[jira] [Created] (PARQUET-1117) ParquetRecordWriter does not provide interface like getRowCount(),getRawDataSize() like org.apache.orc.Writer
liyunzhang_intel created PARQUET-1117: - Summary: ParquetRecordWriter does not provide interface like getRowCount(),getRawDataSize() like org.apache.orc.Writer Key: PARQUET-1117 URL: https://issues.apache.org/jira/browse/PARQUET-1117 Project: Parquet Issue Type: Bug Reporter: liyunzhang_intel Hive with orc can update the statistics like rowCount,rawDataSize after loading data to table. Hive with parquet cannot and need to use analyze command like "analyze table xxx compute statistics noscan" to update these two statistics info. The reason is ParquetRecordWriter used in hive does not provide interfaces like getRowCount(),getRawDataSize(). While org.apache.orc.Writer provides these [two interfaces|https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Writer.java#L68 ]. Anyone knows how to get rowCount and rawDataSize in ParquetRecordWriter? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Compression test data
For anyone that would also like to test the compression codecs, I’ve uploaded a copy of parquet-cli that can read and write zstd, lz4, and brotli to my Apache public folder: http://home.apache.org/~blue/ There’s also a copy of hadoop-common that has all the codec bits for testing zstd. LZ4 should be supported by default, and brotli is built into the parquet-cli Jar. If you want to build the brotli-codec that the Jar uses, the project is here: https://github.com/rdblue/brotli-codec All you need to do is add the hadoop-common Jar to your Hadoop install, copy over the native libs, and run the Parquet CLI like this: alias parquet='hadoop jar parquet-cli-0.2.0.jar org.apache.parquet.cli.Main' parquet help rb On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue wrote: > Hi everyone, > > I ran some tests using 4 of our large tables to compare compression > codecs. I tested gzip, brotli, lz4, and zstd, all with the default > configuration. You can find the raw data and summary tables/graphs in this > spreadsheet: > > https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu- > AI0AshTPSC6C0ttuIw/edit?usp=sharing > > For the test, I used parquet-cli to convert Avro data to Parquet using > each compression format. Run times come from the `time` utility, so this is > an end-to-end test, not just time spent in the compression algorithm. > Still, the overhead was the same across all runs for a given table. > > I ran the tests on my laptop. I had more physical memory available than > the maximum size of the JVM, so I don't think paging was an issue. Data was > read from and written to my local SSD. I wrote an output file for each > compression codec and table 5 times. > > I'm also attaching some sanitized summary information for a row group in > each table. > > Everyone should be able to comment on the results using that link. > > rb > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
Compression test data
Hi everyone, I ran some tests using 4 of our large tables to compare compression codecs. I tested gzip, brotli, lz4, and zstd, all with the default configuration. You can find the raw data and summary tables/graphs in this spreadsheet: https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-AI0AshTPSC6C0ttuIw/edit?usp=sharing For the test, I used parquet-cli to convert Avro data to Parquet using each compression format. Run times come from the `time` utility, so this is an end-to-end test, not just time spent in the compression algorithm. Still, the overhead was the same across all runs for a given table. I ran the tests on my laptop. I had more physical memory available than the maximum size of the JVM, so I don't think paging was an issue. Data was read from and written to my local SSD. I wrote an output file for each compression codec and table 5 times. I'm also attaching some sanitized summary information for a row group in each table. Everyone should be able to comment on the results using that link. rb -- Ryan Blue Software Engineer Netflix Table One: Row group 0: count: 422735 312.85 B records start: 4 total: 126.125 MB type encodings count avg size nulls other_properties.map.key BINARYG _ R 17326411 0.09 B 0 other_properties.map.value BINARYG _ R_ F 17326411 7.37 B 0 event_utc_ms INT64 G _ R 4227351.44 B 0 hostname BINARYG _ R 4227350.09 B 0 another_map.map.key BINARYG _ R 8135170.03 B 9 another_map.map.valueBINARYG _ R_ F 8135172.81 B 9 Table Three: Row group 0: count: 2153 59.883 kB records start: 4 total: 125.907 MB type encodings count avg size nulls other_properties.map.key BINARYG _ R 47366 0.01 B 0 other_properties.map.value BINARYG _ R_ F 47366 4.83 B 0 event_utc_ms INT64 G _ 2153 2.08 B 0 hostname BINARYG _ R 2153 1.50 B 0 another_map.map.key BINARYG _ 2153 0.02 B 2153 another_map.map.valueBINARYG _ 2153 0.02 B 2153 column 1 INT64 G _ 2153 0.02 B 2153 column 2 INT64 G _ 2153 5.71 B 0 column 3 BINARYG _ R 2153 0.80 B 0 column 4 BINARYG _ 2153 0.02 B 2153 column 5 INT64 G _ 2153 2.67 B 0 column 6 BINARYG _ 2153 0.02 B 2153 column 7 BINARYG _ R 2153 0.05 B 0 column 8 BINARYG _ 2153 0.02 B 2153 column 9 BINARYG _ 2153 0.02 B 2153 column 10BINARYG _ R 2153 0.79 B 0 column 11BINARYG _ 2153 15.600 kB 0 column 12BINARYG _ 2153 18.22 B0 column 13INT32 G _ R 2153 0.13 B 0 column 14BINARYG _ R 2153 0.18 B 0 column 15BINARYG _ 2153 41.811 kB 0 column 16BINARYG _ 2153 2.337 kB 0 Table Two: Row group 0: count: 443955 278.25 B records start: 4 total: 117.809 MB type encodings count avg size nulls other_properties.map.key BINARYG _ R 11514004 0.10 B 0 other_properties.map.value BINARYG _ R_ F 11514004 5.99 B 0 event_utc_ms INT64 G _ R 4439552.49 B 0 hostname BINARYG _ R 4439550.78 B 0 column 1 BINARYG _ R 4439550.10 B 266638 column 2 BINARYG _ R 4439550.01 B 443194 column 3 BINARYG _ 4439550.00 B 443955 column 4 BINARYG _ R 4439550.00 B 443194 column 5 BINARYG _ R 4439550.69 B 267399 column 6 BINARYG _ R 4439550.18 B 266638 column 7 BINARYG _ R 4439550.13 B 43730 column 8 BINARYG _ R 4439550.43 B 43730 column 9 BINARYG _ 4439550.00 B 443955 column 10BINARYG _ 4439550.00 B 443955 column 11BINARYG _ R 4439550.54 B 44491 column 12
[jira] [Created] (PARQUET-1116) Add Yetus InterfaceAudience annotations to Parquet
Zoltan Ivanfi created PARQUET-1116: -- Summary: Add Yetus InterfaceAudience annotations to Parquet Key: PARQUET-1116 URL: https://issues.apache.org/jira/browse/PARQUET-1116 Project: Parquet Issue Type: Improvement Reporter: Zoltan Ivanfi Parquet should use [Yetus InterfaceAudience annotations|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/package-summary.html] to specify its API's intended [audience|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/InterfaceAudience.html] (public/private) and optionally [stability|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/InterfaceStability.html] (unstable/evolving/stable) in code annotations both for the benefit of its users and also for ensuring that no breaking changes happen in the public API. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1115) Prevent users from misusing parquet-tools merge
Zoltan Ivanfi created PARQUET-1115: -- Summary: Prevent users from misusing parquet-tools merge Key: PARQUET-1115 URL: https://issues.apache.org/jira/browse/PARQUET-1115 Project: Parquet Issue Type: Improvement Reporter: Zoltan Ivanfi Assignee: Zoltan Ivanfi To prevent users from using {{parquet-tools merge}} in scenarios where its use is not practical, we should describe its limitations in the help text of this command. Additionally, we should add a warning to the output of the merge command if the size of the original row groups are below a threshold. Reasoning: Many users are tempted to use the new {{parquet-tools merge}} functionality, because they want to achieve good performance and historically that has been associated with large Parquet files. However, in practice Hive performance won't change significantly after using {{parquet-tools merge}}, but Impala performance will be much worse. The reason for that is that good performance is not a result of large files but large rowgroups instead (up to the HDFS block size). However, {{parquet-tools merge}} does not merge rowgroups, it just places them one after the other. It was intended to be used for Parquet files that are already arranged in row groups of the desired size. When used to merge many small files, the resulting file will still contain small row groups and one loses most of the advantages of larger files (the only one that remains is that it takes a single HDFS operation to read them). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
parquet sync
starting now at: https://meet.google.com/wgv-qske-hzs