Notes:
Parquet Sync Jul 19 2017
Intros, Agenda:
Anna, Zoltan (Cloudera Budapest): Column Chunk deprecation (PARQUET-291), type 
dependent sort orderings
Cheng (Intel Shangai): Parquet Bloom Filter 
Jim (Cloudera): Bloom Filter
Lars (Cloudera Impala): 
Marcel: Column index design
Ryan (Netflix): Bloom Filters, Parquet-908 (Logical types), Arrow timestamp
Pooja (Cloudera,):
Julien: parquet-mr release, logical types, bloom filter  

Bloom Filter: PARQUET-41
 - 
https://docs.google.com/document/d/1I2UWCQPd-_6uO8gqf4cDSgRJxspd-ykTTnHdTCSnKUM/edit
 
<https://docs.google.com/document/d/1I2UWCQPd-_6uO8gqf4cDSgRJxspd-ykTTnHdTCSnKUM/edit>
 - use case: get by id on given very unique column
 - Distinct value count: table property? or end user input?
 - discussion for picking the hash function and how to set the bits: 
    - Jim referred to: 
       - http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf 
<http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf>
       - Block split Bloom filter 
https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202 
<https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202>
 - Were should store the Bloom filter data in between row groups.
   - offset and length in the column metadata
 - how do we know the number of distinct values? provided or figure out on the 
fly:
   - keep hashes in memory: 8 bytes per distinct hash
 - no Bloom Filter for dictionary encoded columns.
 - UUID column are a good example.

Column Indices:
  - Pooja update:
  - writing column indices to parquet files:
     - update from design: all offsets written together.
     - Pooja: to update the design doc
     - < .1% write overhead 
  - Parquet index filter
  - TODO: IO layer to skip pages instead of reading them. 

Logical Types:
 - Consensus with new structure
 - Arrow includes the TZ in DateTime. Will use YTC for parquet ts
 - TODO Ryan: get back on the PR. get it ready for commit

parquet-cli: 
 - +1 already
 - Ryan to commit

Brotli compression:
 - TODO: feedback from Impala.

Parquet-mr:
 - Patch release.

parquet-thrift
 - need to upgrade to latest. thrift -.7 is a pain to compile on recent macos

type dependent sort:
 - signed comparison for int96?
   - min and max are wrong with exception of min == max
- interval type: Zoltan to open a jira

Next time:
 follow up on Column Chunk deprecation (PARQUET-291)

> On Jul 19, 2017, at 9:29 AM, Julien Le Dem <[email protected]> wrote:
> 
> https://plus.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.vtfomsfgpbvjqd8d3kb8hte3j8

Reply via email to