Re: parquet sync

Julien Le Dem Thu, 28 Sep 2017 10:11:57 -0700

Parquet Sync Sept 27 2017:
Attendance and agenda:
Lars (Cloudera Impala):
 - Parquet page index status
Zoltan (Cloudera impala):
 - vectorization
 - api annotation (Private/Public)
Ryan (Netflix):
 - logical types commit
 - Compression tests
Wes (TwoSigma):
 - Compression C++
Julien:
 - testing parquet files: JSON and Parquet.
Jim (Cloudera)


Notes:
Page Index status:
 - need feedback on PR: https://github.com/apache/parquet-format/pull/63
   Action: Julien, Marcel Review
Vectorization:
- https://issues.apache.org/jira/browse/PARQUET-131
  original discussion in parquet which stalled.
- https://issues.apache.org/jira/browse/HIVE-14815
   Hive vectorized parquet read.
   Use annotations to clarify the state of an api
   - Zoltan to open jira: annotations.
   - need to reopen vectorized reader discussion. Follow up on JIRA-131
Logical types:
 - action: need to review PR:
https://github.com/apache/parquet-format/pull/51
Compression tests:
 - Ryan: used parquet-cli with 4 largest/most expensive tables
   => some are big map of k/v pairs, others are features/structured
ran 5 times + average.
will send spreadsheet with results for brotli/zstandard/lz4
brotli/zstandard look like winners: need more extensive tests
 brotli level 5 seems to be a good tradeoff compression cost/size
 lz4 quickest compression time but largest output
 zstandard a bit faster and a bit smaller than brotli
 uses:
   - jbrotli: embedded native library in jar
   - zstd: zlibnative path. packaged in ubuntu
 - action: Ryan cleanup and send out report
 - Wes: C++
speed: gzip, snappy, lz4, zstd

parquet files for tests:
 - Impala has a repository of files for tests:
https://github.com/apache/incubator-impala/tree/master/testdata
 - old compat test repo: https://github.com/Parquet/parquet-compatibility
 - have a repository of files.
 - open a JIRA: Lars.

parquet-tools merge command:
  - merge command: puts row groups after one another.
  - need jira to add comment on how this works (concatenates existing
rowgroups without combining them in larger ones)

Re: parquet sync

Reply via email to