Notes Ryan (Netflix): - Parquet bloom filters Julien (Dremio): - timestamp logical type - timestamp unknown ordering - pig decimal Deepak (Vertica): - timestamp - bloom filter
Bloom filters: - Intel came back with good numbers on their bloom filters Pull Request - TODO: define the spec to make sure it’s portable - we need to minimize the need for tuning: - 5% default false positive rate? - detect overfilling to increase size automatically - keep hashes in memory or rehash values to fix overfilling? - possibly HLL for cardinality estimation (but let’s not increase the scope) - Ryan will help intel with their Pull Request - Deepak will look into a c++ prototype to confirm portability. Timestamp logical type: - need to reconcile arrow and parquet - https://issues.apache.org/jira/browse/ARROW-637 - https://github.com/apache/arrow/blob/3d8b1906ba7b0a6c856e8f3aeb54621489080794/format/Schema.fbs#L117 - https://github.com/apache/parquet-format/pull/51#discussion_r118303404 - discrepancy: - in Arrow, the timezone in the type means "with timezone”. No timezone means “without timezone” - In parquet we just have a boolean flag that means “with/without timezone” - that means the types are incompatible for now. - should the timezone field be optional in arrow and have an explicit “witTimeZone” boolean flag? - Julien to send email cross list to clarify. Decimal in Pig: https://github.com/apache/parquet-mr/pull/404 <https://github.com/apache/parquet-mr/pull/404#pullrequestreview-40090544> - Ryan to comment regarding parquet-avro impl: Indexing review to be done next time On Wed, May 24, 2017 at 10:04 AM, Julien Le Dem <[email protected]> wrote: > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up > > -- > Julien > -- Julien
