[jira] [Created] (PARQUET-1117) ParquetRecordWriter does not provide interface like getRowCount(),getRawDataSize() like org.apache.orc.Writer

2017-09-27 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PARQUET-1117:
-

 Summary: ParquetRecordWriter does not provide interface like 
getRowCount(),getRawDataSize() like org.apache.orc.Writer  
 Key: PARQUET-1117
 URL: https://issues.apache.org/jira/browse/PARQUET-1117
 Project: Parquet
  Issue Type: Bug
Reporter: liyunzhang_intel


Hive with orc can update the statistics like rowCount,rawDataSize after loading 
data to table. Hive with parquet cannot and need to use analyze command like 
"analyze table xxx compute statistics noscan" to update these two statistics 
info.  The reason is ParquetRecordWriter used in hive does not provide 
interfaces like getRowCount(),getRawDataSize(). While org.apache.orc.Writer  
provides these [two 
interfaces|https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Writer.java#L68
 ].  Anyone knows how to get rowCount and rawDataSize in ParquetRecordWriter?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Compression test data

2017-09-27 Thread Ryan Blue
For anyone that would also like to test the compression codecs, I’ve
uploaded a copy of parquet-cli that can read and write zstd, lz4, and
brotli to my Apache public folder:

http://home.apache.org/~blue/

There’s also a copy of hadoop-common that has all the codec bits for
testing zstd. LZ4 should be supported by default, and brotli is built into
the parquet-cli Jar. If you want to build the brotli-codec that the Jar
uses, the project is here:

https://github.com/rdblue/brotli-codec

All you need to do is add the hadoop-common Jar to your Hadoop install,
copy over the native libs, and run the Parquet CLI like this:

alias parquet='hadoop jar parquet-cli-0.2.0.jar org.apache.parquet.cli.Main'
parquet help

rb
​

On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue  wrote:

> Hi everyone,
>
> I ran some tests using 4 of our large tables to compare compression
> codecs. I tested gzip, brotli, lz4, and zstd, all with the default
> configuration. You can find the raw data and summary tables/graphs in this
> spreadsheet:
>
> https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-
> AI0AshTPSC6C0ttuIw/edit?usp=sharing
>
> For the test, I used parquet-cli to convert Avro data to Parquet using
> each compression format. Run times come from the `time` utility, so this is
> an end-to-end test, not just time spent in the compression algorithm.
> Still, the overhead was the same across all runs for a given table.
>
> I ran the tests on my laptop. I had more physical memory available than
> the maximum size of the JVM, so I don't think paging was an issue. Data was
> read from and written to my local SSD. I wrote an output file for each
> compression codec and table 5 times.
>
> I'm also attaching some sanitized summary information for a row group in
> each table.
>
> Everyone should be able to comment on the results using that link.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix


Compression test data

2017-09-27 Thread Ryan Blue
Hi everyone,

I ran some tests using 4 of our large tables to compare compression codecs.
I tested gzip, brotli, lz4, and zstd, all with the default configuration.
You can find the raw data and summary tables/graphs in this spreadsheet:

https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-AI0AshTPSC6C0ttuIw/edit?usp=sharing

For the test, I used parquet-cli to convert Avro data to Parquet using each
compression format. Run times come from the `time` utility, so this is an
end-to-end test, not just time spent in the compression algorithm. Still,
the overhead was the same across all runs for a given table.

I ran the tests on my laptop. I had more physical memory available than the
maximum size of the JVM, so I don't think paging was an issue. Data was
read from and written to my local SSD. I wrote an output file for each
compression codec and table 5 times.

I'm also attaching some sanitized summary information for a row group in
each table.

Everyone should be able to comment on the results using that link.

rb

-- 
Ryan Blue
Software Engineer
Netflix
Table One:

Row group 0:  count: 422735  312.85 B records  start: 4  total: 126.125 MB

 type  encodings count avg size   nulls
other_properties.map.key BINARYG _ R 17326411  0.09 B 0
other_properties.map.value   BINARYG _ R_ F  17326411  7.37 B 0
event_utc_ms INT64 G _ R 4227351.44 B 0
hostname BINARYG _ R 4227350.09 B 0
another_map.map.key  BINARYG _ R 8135170.03 B 9
another_map.map.valueBINARYG _ R_ F  8135172.81 B 9


Table Three:

Row group 0:  count: 2153  59.883 kB records  start: 4  total: 125.907 MB

 type  encodings count avg size   nulls
other_properties.map.key BINARYG _ R 47366 0.01 B 0
other_properties.map.value   BINARYG _ R_ F  47366 4.83 B 0
event_utc_ms INT64 G   _ 2153  2.08 B 0
hostname BINARYG _ R 2153  1.50 B 0
another_map.map.key  BINARYG   _ 2153  0.02 B 2153
another_map.map.valueBINARYG   _ 2153  0.02 B 2153
column 1 INT64 G   _ 2153  0.02 B 2153
column 2 INT64 G   _ 2153  5.71 B 0
column 3 BINARYG _ R 2153  0.80 B 0
column 4 BINARYG   _ 2153  0.02 B 2153
column 5 INT64 G   _ 2153  2.67 B 0
column 6 BINARYG   _ 2153  0.02 B 2153
column 7 BINARYG _ R 2153  0.05 B 0
column 8 BINARYG   _ 2153  0.02 B 2153
column 9 BINARYG   _ 2153  0.02 B 2153
column 10BINARYG _ R 2153  0.79 B 0
column 11BINARYG   _ 2153  15.600 kB  0
column 12BINARYG   _ 2153  18.22 B0
column 13INT32 G _ R 2153  0.13 B 0
column 14BINARYG _ R 2153  0.18 B 0
column 15BINARYG   _ 2153  41.811 kB  0
column 16BINARYG   _ 2153  2.337 kB   0


Table Two:

Row group 0:  count: 443955  278.25 B records  start: 4  total: 117.809 MB

 type  encodings count avg size   nulls
other_properties.map.key BINARYG _ R 11514004  0.10 B 0
other_properties.map.value   BINARYG _ R_ F  11514004  5.99 B 0
event_utc_ms INT64 G _ R 4439552.49 B 0
hostname BINARYG _ R 4439550.78 B 0
column 1 BINARYG _ R 4439550.10 B 266638
column 2 BINARYG _ R 4439550.01 B 443194
column 3 BINARYG   _ 4439550.00 B 443955
column 4 BINARYG _ R 4439550.00 B 443194
column 5 BINARYG _ R 4439550.69 B 267399
column 6 BINARYG _ R 4439550.18 B 266638
column 7 BINARYG _ R 4439550.13 B 43730
column 8 BINARYG _ R 4439550.43 B 43730
column 9 BINARYG   _ 4439550.00 B 443955
column 10BINARYG   _ 4439550.00 B 443955
column 11BINARYG _ R 4439550.54 B 44491
column 12   

[jira] [Created] (PARQUET-1116) Add Yetus InterfaceAudience annotations to Parquet

2017-09-27 Thread Zoltan Ivanfi (JIRA)
Zoltan Ivanfi created PARQUET-1116:
--

 Summary: Add Yetus InterfaceAudience annotations to Parquet
 Key: PARQUET-1116
 URL: https://issues.apache.org/jira/browse/PARQUET-1116
 Project: Parquet
  Issue Type: Improvement
Reporter: Zoltan Ivanfi


Parquet should use [Yetus InterfaceAudience 
annotations|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/package-summary.html]
 to specify its API's intended 
[audience|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/InterfaceAudience.html]
 (public/private) and optionally 
[stability|https://yetus.apache.org/documentation/in-progress/audience-annotations-apidocs/org/apache/yetus/audience/InterfaceStability.html]
 (unstable/evolving/stable) in code annotations both for the benefit of its 
users and also for ensuring that no breaking changes happen in the public API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1115) Prevent users from misusing parquet-tools merge

2017-09-27 Thread Zoltan Ivanfi (JIRA)
Zoltan Ivanfi created PARQUET-1115:
--

 Summary: Prevent users from misusing parquet-tools merge
 Key: PARQUET-1115
 URL: https://issues.apache.org/jira/browse/PARQUET-1115
 Project: Parquet
  Issue Type: Improvement
Reporter: Zoltan Ivanfi
Assignee: Zoltan Ivanfi


To prevent users from using {{parquet-tools merge}} in scenarios where its use 
is not practical, we should describe its limitations in the help text of this 
command. Additionally, we should add a warning to the output of the merge 
command if the size of the original row groups are below a threshold.

Reasoning:

Many users are tempted to use the new {{parquet-tools merge}} functionality, 
because they want to achieve good performance and historically that has been 
associated with large Parquet files. However, in practice Hive performance 
won't change significantly after using {{parquet-tools merge}}, but Impala 
performance will be much worse. The reason for that is that good performance is 
not a result of large files but large rowgroups instead (up to the HDFS block 
size).

However, {{parquet-tools merge}} does not merge rowgroups, it just places them 
one after the other. It was intended to be used for Parquet files that are 
already arranged in row groups of the desired size. When used to merge many 
small files, the resulting file will still contain small row groups and one 
loses most of the advantages of larger files (the only one that remains is that 
it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


parquet sync

2017-09-27 Thread Julien Le Dem
starting now at:
https://meet.google.com/wgv-qske-hzs