Re: Parquet sync starting now on hangout

Marcel Kornacker Wed, 08 Mar 2017 14:37:08 -0800

One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
is needed.


On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem <[email protected]> wrote:
> 2. The other thing to look into is HyperLogLog for approximate distinct
> value count. Similar concepts than Bloom filters
>
> On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue <[email protected]> wrote:
>
>> To follow up on the bloom filter discussion: The discussion on PARQUET-41
>> <https://issues.apache.org/jira/browse/PARQUET-41> has a lot of
>> information
>> and context for the bloom filter spreadsheet
>> <https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw
>> qiiFe8Sazclf5Y/edit?usp=sharing>
>> I mentioned in the sync-up. The main things we need to worry about are:
>>
>> 1. When are bloom filters worth using? Columns with low % unique will
>> already be dictionary-encoded and dictionary filtering has no
>> false-positives.
>> 2. How should Parquet track the % unique for a column to size the bloom
>> filter correctly? 2x overloading results in a 10x increase in
>> false-positives, so this must avoid overloading.
>> 3. How should Parquet set the target false-positive probability? This is
>> related to the number of lookups in queries. 1% FPP with 5 lookups results
>> in 4.9% FPP for a query.
>>
>> I think there was also some analysis of page level vs row-group level bloom
>> filters and using geometrically decreasing FPP (scalable bloom filters).
>>
>> rb
>>
>> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]> wrote:
>>
>> > Notes:
>> >
>> > Attendees/Agenda:
>> > Zoltan (Cloudera, file formats):
>> >   - timestamp types
>> > Ryan (Netflix):
>> >   - timestamp types
>> >   - fix for sorting metadata (min-max)
>> > Deepak (Vertica, parquet-cpp):
>> >   - timestamp
>> > Emily (IBM Spark Technology center)
>> > Greg (Cloudera):
>> >  - timestamp
>> > Lars (Cloudera impala):
>> >  - min-max (https://github.com/apache/parquet-format/pull/46)
>> > Marcel (Cl Impala):
>> >  - timestamp
>> >  - sorting/min max
>> >  - bloom filters
>> > Julien (Dremio):
>> >  - sorting/min max
>> >  - timestamp.
>> >
>> > - Timestamp (2 types):
>> >   - Floating Timestamp
>> >     - ambiguity to the TZ: year/month/day/microseconds is the data
>> stored.
>> >     - timezone less
>> >     - same binary representation as current Timestamp. Different logical
>> > annotation.
>> >     - how to store metadata. Same binary format w/wo.
>> >     - action: Ryan to propose a PR on parquet-format
>> >   - Timestamp with Timezone.
>> >     - stored in UTC
>> >     - client side conversion to UTC
>> >     - writer timezone should be stored in the metadata?
>> >   - need to clarify if time can be adjusted.
>> >   - Int96: to be deprecated
>> >     - int64 used instead with logical type.
>> >     - won’t fix int96 ordering. Instead use replacement type.
>> >     - Lars to update the JIRA (PARQUET-323)
>> >   - new binary format : int64 storing actual date (year month day) +
>> > microseconds since midnight.
>> >     - Marcel to open a JIRA.
>> > - Sorting:
>> >   - Ryan to update the the PR (
>> > https://github.com/apache/parquet-format/pull/46)
>> > - Bloom filter: (PARQUET-319, PARQUET-41)
>> >   - take analysis from original PR:
>> >     - https://github.com/apache/parquet-mr/pull/215
>> >     - https://github.com/apache/parquet-format/pull/28
>> >   - need to define metadata.
>> > - C++ code reuse between parquet-cpp, impala, …
>> >   - impala team to discuss how they want to do that.
>> > - store page level stats in footer (PARQUET-907)
>> >   - several options:
>> >     - Index Page: similar to an ISAM index. 1 per row group: if ordered
>> > just maxes and offsets
>> >     - add optional field in footer metadata.
>> >
>> >
>> >
>> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]>
>> wrote:
>> >
>> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>> > >
>> > > --
>> > > Julien
>> > >
>> >
>> >
>> >
>> > --
>> > Julien
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Julien

Re: Parquet sync starting now on hangout

Reply via email to