Parquet Sync Sept 27 2017: Attendance and agenda: Lars (Cloudera Impala): - Parquet page index status Zoltan (Cloudera impala): - vectorization - api annotation (Private/Public) Ryan (Netflix): - logical types commit - Compression tests Wes (TwoSigma): - Compression C++ Julien: - testing parquet files: JSON and Parquet. Jim (Cloudera)
Notes: Page Index status: - need feedback on PR: https://github.com/apache/parquet-format/pull/63 Action: Julien, Marcel Review Vectorization: - https://issues.apache.org/jira/browse/PARQUET-131 original discussion in parquet which stalled. - https://issues.apache.org/jira/browse/HIVE-14815 Hive vectorized parquet read. Use annotations to clarify the state of an api - Zoltan to open jira: annotations. - need to reopen vectorized reader discussion. Follow up on JIRA-131 Logical types: - action: need to review PR: https://github.com/apache/parquet-format/pull/51 Compression tests: - Ryan: used parquet-cli with 4 largest/most expensive tables => some are big map of k/v pairs, others are features/structured ran 5 times + average. will send spreadsheet with results for brotli/zstandard/lz4 brotli/zstandard look like winners: need more extensive tests brotli level 5 seems to be a good tradeoff compression cost/size lz4 quickest compression time but largest output zstandard a bit faster and a bit smaller than brotli uses: - jbrotli: embedded native library in jar - zstd: zlibnative path. packaged in ubuntu - action: Ryan cleanup and send out report - Wes: C++ speed: gzip, snappy, lz4, zstd parquet files for tests: - Impala has a repository of files for tests: https://github.com/apache/incubator-impala/tree/master/testdata - old compat test repo: https://github.com/Parquet/parquet-compatibility - have a repository of files. - open a JIRA: Lars. parquet-tools merge command: - merge command: puts row groups after one another. - need jira to add comment on how this works (concatenates existing rowgroups without combining them in larger ones)
