Re: Parquet sync starting now on hangout

Ryan Blue Wed, 08 Mar 2017 13:40:28 -0800

To follow up on the bloom filter discussion: The discussion on PARQUET-41
<https://issues.apache.org/jira/browse/PARQUET-41> has a lot of information
and context for the bloom filter spreadsheet
<https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit?usp=sharing>
I mentioned in the sync-up. The main things we need to worry about are:


1. When are bloom filters worth using? Columns with low % unique will
already be dictionary-encoded and dictionary filtering has no
false-positives.
2. How should Parquet track the % unique for a column to size the bloom
filter correctly? 2x overloading results in a 10x increase in
false-positives, so this must avoid overloading.
3. How should Parquet set the target false-positive probability? This is
related to the number of lookups in queries. 1% FPP with 5 lookups results
in 4.9% FPP for a query.

I think there was also some analysis of page level vs row-group level bloom
filters and using geometrically decreasing FPP (scalable bloom filters).

rb

On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]> wrote:

> Notes:
>
> Attendees/Agenda:
> Zoltan (Cloudera, file formats):
>   - timestamp types
> Ryan (Netflix):
>   - timestamp types
>   - fix for sorting metadata (min-max)
> Deepak (Vertica, parquet-cpp):
>   - timestamp
> Emily (IBM Spark Technology center)
> Greg (Cloudera):
>  - timestamp
> Lars (Cloudera impala):
>  - min-max (https://github.com/apache/parquet-format/pull/46)
> Marcel (Cl Impala):
>  - timestamp
>  - sorting/min max
>  - bloom filters
> Julien (Dremio):
>  - sorting/min max
>  - timestamp.
>
> - Timestamp (2 types):
>   - Floating Timestamp
>     - ambiguity to the TZ: year/month/day/microseconds is the data stored.
>     - timezone less
>     - same binary representation as current Timestamp. Different logical
> annotation.
>     - how to store metadata. Same binary format w/wo.
>     - action: Ryan to propose a PR on parquet-format
>   - Timestamp with Timezone.
>     - stored in UTC
>     - client side conversion to UTC
>     - writer timezone should be stored in the metadata?
>   - need to clarify if time can be adjusted.
>   - Int96: to be deprecated
>     - int64 used instead with logical type.
>     - won’t fix int96 ordering. Instead use replacement type.
>     - Lars to update the JIRA (PARQUET-323)
>   - new binary format : int64 storing actual date (year month day) +
> microseconds since midnight.
>     - Marcel to open a JIRA.
> - Sorting:
>   - Ryan to update the the PR (
> https://github.com/apache/parquet-format/pull/46)
> - Bloom filter: (PARQUET-319, PARQUET-41)
>   - take analysis from original PR:
>     - https://github.com/apache/parquet-mr/pull/215
>     - https://github.com/apache/parquet-format/pull/28
>   - need to define metadata.
> - C++ code reuse between parquet-cpp, impala, …
>   - impala team to discuss how they want to do that.
> - store page level stats in footer (PARQUET-907)
>   - several options:
>     - Index Page: similar to an ISAM index. 1 per row group: if ordered
> just maxes and offsets
>     - add optional field in footer metadata.
>
>
>
> On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]> wrote:
>
> > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >
> > --
> > Julien
> >
>
>
>
> --
> Julien
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Parquet sync starting now on hangout

Reply via email to