Notes

Ryan (Netflix):
 - Parquet bloom filters
Julien (Dremio):
 - timestamp logical type
 - timestamp unknown ordering
 - pig decimal
Deepak (Vertica):
  - timestamp
  - bloom filter

Bloom filters:
 - Intel came back with good numbers on their bloom filters Pull Request
 - TODO: define the spec to make sure it’s portable
 - we need to minimize the need for tuning:
   - 5% default false positive rate?
   - detect overfilling to increase size automatically
   - keep hashes in memory or rehash values to fix overfilling?
   - possibly HLL for cardinality estimation (but let’s not increase the
scope)
 - Ryan will help intel with their Pull Request
 - Deepak will look into a c++ prototype to confirm portability.

Timestamp logical type:
 - need to reconcile arrow and parquet
   - https://issues.apache.org/jira/browse/ARROW-637
   -
https://github.com/apache/arrow/blob/3d8b1906ba7b0a6c856e8f3aeb54621489080794/format/Schema.fbs#L117
   - https://github.com/apache/parquet-format/pull/51#discussion_r118303404
 - discrepancy:
   - in Arrow, the timezone in the type means "with timezone”. No timezone
means “without timezone”
   - In parquet we just have a boolean flag that means “with/without
timezone”
   - that means the types are incompatible for now.
 - should the timezone field be optional in arrow and have an explicit
“witTimeZone” boolean flag?
 - Julien to send email cross list to clarify.

Decimal in Pig:  https://github.com/apache/parquet-mr/pull/404
<https://github.com/apache/parquet-mr/pull/404#pullrequestreview-40090544>
 - Ryan to comment regarding parquet-avro impl:

Indexing review to be done next time



On Wed, May 24, 2017 at 10:04 AM, Julien Le Dem <[email protected]> wrote:

> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien

Reply via email to