Notes: Parquet Sync Jul 19 2017 Intros, Agenda: Anna, Zoltan (Cloudera Budapest): Column Chunk deprecation (PARQUET-291), type dependent sort orderings Cheng (Intel Shangai): Parquet Bloom Filter Jim (Cloudera): Bloom Filter Lars (Cloudera Impala): Marcel: Column index design Ryan (Netflix): Bloom Filters, Parquet-908 (Logical types), Arrow timestamp Pooja (Cloudera,): Julien: parquet-mr release, logical types, bloom filter
Bloom Filter: PARQUET-41 - https://docs.google.com/document/d/1I2UWCQPd-_6uO8gqf4cDSgRJxspd-ykTTnHdTCSnKUM/edit <https://docs.google.com/document/d/1I2UWCQPd-_6uO8gqf4cDSgRJxspd-ykTTnHdTCSnKUM/edit> - use case: get by id on given very unique column - Distinct value count: table property? or end user input? - discussion for picking the hash function and how to set the bits: - Jim referred to: - http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf <http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf> - Block split Bloom filter https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202 <https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202> - Were should store the Bloom filter data in between row groups. - offset and length in the column metadata - how do we know the number of distinct values? provided or figure out on the fly: - keep hashes in memory: 8 bytes per distinct hash - no Bloom Filter for dictionary encoded columns. - UUID column are a good example. Column Indices: - Pooja update: - writing column indices to parquet files: - update from design: all offsets written together. - Pooja: to update the design doc - < .1% write overhead - Parquet index filter - TODO: IO layer to skip pages instead of reading them. Logical Types: - Consensus with new structure - Arrow includes the TZ in DateTime. Will use YTC for parquet ts - TODO Ryan: get back on the PR. get it ready for commit parquet-cli: - +1 already - Ryan to commit Brotli compression: - TODO: feedback from Impala. Parquet-mr: - Patch release. parquet-thrift - need to upgrade to latest. thrift -.7 is a pain to compile on recent macos type dependent sort: - signed comparison for int96? - min and max are wrong with exception of min == max - interval type: Zoltan to open a jira Next time: follow up on Column Chunk deprecation (PARQUET-291) > On Jul 19, 2017, at 9:29 AM, Julien Le Dem <[email protected]> wrote: > > https://plus.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.vtfomsfgpbvjqd8d3kb8hte3j8
