2. The other thing to look into is HyperLogLog for approximate distinct value count. Similar concepts than Bloom filters
On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue <[email protected]> wrote: > To follow up on the bloom filter discussion: The discussion on PARQUET-41 > <https://issues.apache.org/jira/browse/PARQUET-41> has a lot of > information > and context for the bloom filter spreadsheet > <https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw > qiiFe8Sazclf5Y/edit?usp=sharing> > I mentioned in the sync-up. The main things we need to worry about are: > > 1. When are bloom filters worth using? Columns with low % unique will > already be dictionary-encoded and dictionary filtering has no > false-positives. > 2. How should Parquet track the % unique for a column to size the bloom > filter correctly? 2x overloading results in a 10x increase in > false-positives, so this must avoid overloading. > 3. How should Parquet set the target false-positive probability? This is > related to the number of lookups in queries. 1% FPP with 5 lookups results > in 4.9% FPP for a query. > > I think there was also some analysis of page level vs row-group level bloom > filters and using geometrically decreasing FPP (scalable bloom filters). > > rb > > On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]> wrote: > > > Notes: > > > > Attendees/Agenda: > > Zoltan (Cloudera, file formats): > > - timestamp types > > Ryan (Netflix): > > - timestamp types > > - fix for sorting metadata (min-max) > > Deepak (Vertica, parquet-cpp): > > - timestamp > > Emily (IBM Spark Technology center) > > Greg (Cloudera): > > - timestamp > > Lars (Cloudera impala): > > - min-max (https://github.com/apache/parquet-format/pull/46) > > Marcel (Cl Impala): > > - timestamp > > - sorting/min max > > - bloom filters > > Julien (Dremio): > > - sorting/min max > > - timestamp. > > > > - Timestamp (2 types): > > - Floating Timestamp > > - ambiguity to the TZ: year/month/day/microseconds is the data > stored. > > - timezone less > > - same binary representation as current Timestamp. Different logical > > annotation. > > - how to store metadata. Same binary format w/wo. > > - action: Ryan to propose a PR on parquet-format > > - Timestamp with Timezone. > > - stored in UTC > > - client side conversion to UTC > > - writer timezone should be stored in the metadata? > > - need to clarify if time can be adjusted. > > - Int96: to be deprecated > > - int64 used instead with logical type. > > - won’t fix int96 ordering. Instead use replacement type. > > - Lars to update the JIRA (PARQUET-323) > > - new binary format : int64 storing actual date (year month day) + > > microseconds since midnight. > > - Marcel to open a JIRA. > > - Sorting: > > - Ryan to update the the PR ( > > https://github.com/apache/parquet-format/pull/46) > > - Bloom filter: (PARQUET-319, PARQUET-41) > > - take analysis from original PR: > > - https://github.com/apache/parquet-mr/pull/215 > > - https://github.com/apache/parquet-format/pull/28 > > - need to define metadata. > > - C++ code reuse between parquet-cpp, impala, … > > - impala team to discuss how they want to do that. > > - store page level stats in footer (PARQUET-907) > > - several options: > > - Index Page: similar to an ISAM index. 1 per row group: if ordered > > just maxes and offsets > > - add optional field in footer metadata. > > > > > > > > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]> > wrote: > > > > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up > > > > > > -- > > > Julien > > > > > > > > > > > -- > > Julien > > > > > > -- > Ryan Blue > Software Engineer > Netflix > -- Julien
