One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in addition to TIMESTAMP_MICROS? From SQL perspective, only the latter is needed.
On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem <[email protected]> wrote: > 2. The other thing to look into is HyperLogLog for approximate distinct > value count. Similar concepts than Bloom filters > > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue <[email protected]> wrote: > >> To follow up on the bloom filter discussion: The discussion on PARQUET-41 >> <https://issues.apache.org/jira/browse/PARQUET-41> has a lot of >> information >> and context for the bloom filter spreadsheet >> <https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw >> qiiFe8Sazclf5Y/edit?usp=sharing> >> I mentioned in the sync-up. The main things we need to worry about are: >> >> 1. When are bloom filters worth using? Columns with low % unique will >> already be dictionary-encoded and dictionary filtering has no >> false-positives. >> 2. How should Parquet track the % unique for a column to size the bloom >> filter correctly? 2x overloading results in a 10x increase in >> false-positives, so this must avoid overloading. >> 3. How should Parquet set the target false-positive probability? This is >> related to the number of lookups in queries. 1% FPP with 5 lookups results >> in 4.9% FPP for a query. >> >> I think there was also some analysis of page level vs row-group level bloom >> filters and using geometrically decreasing FPP (scalable bloom filters). >> >> rb >> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]> wrote: >> >> > Notes: >> > >> > Attendees/Agenda: >> > Zoltan (Cloudera, file formats): >> > - timestamp types >> > Ryan (Netflix): >> > - timestamp types >> > - fix for sorting metadata (min-max) >> > Deepak (Vertica, parquet-cpp): >> > - timestamp >> > Emily (IBM Spark Technology center) >> > Greg (Cloudera): >> > - timestamp >> > Lars (Cloudera impala): >> > - min-max (https://github.com/apache/parquet-format/pull/46) >> > Marcel (Cl Impala): >> > - timestamp >> > - sorting/min max >> > - bloom filters >> > Julien (Dremio): >> > - sorting/min max >> > - timestamp. >> > >> > - Timestamp (2 types): >> > - Floating Timestamp >> > - ambiguity to the TZ: year/month/day/microseconds is the data >> stored. >> > - timezone less >> > - same binary representation as current Timestamp. Different logical >> > annotation. >> > - how to store metadata. Same binary format w/wo. >> > - action: Ryan to propose a PR on parquet-format >> > - Timestamp with Timezone. >> > - stored in UTC >> > - client side conversion to UTC >> > - writer timezone should be stored in the metadata? >> > - need to clarify if time can be adjusted. >> > - Int96: to be deprecated >> > - int64 used instead with logical type. >> > - won’t fix int96 ordering. Instead use replacement type. >> > - Lars to update the JIRA (PARQUET-323) >> > - new binary format : int64 storing actual date (year month day) + >> > microseconds since midnight. >> > - Marcel to open a JIRA. >> > - Sorting: >> > - Ryan to update the the PR ( >> > https://github.com/apache/parquet-format/pull/46) >> > - Bloom filter: (PARQUET-319, PARQUET-41) >> > - take analysis from original PR: >> > - https://github.com/apache/parquet-mr/pull/215 >> > - https://github.com/apache/parquet-format/pull/28 >> > - need to define metadata. >> > - C++ code reuse between parquet-cpp, impala, … >> > - impala team to discuss how they want to do that. >> > - store page level stats in footer (PARQUET-907) >> > - several options: >> > - Index Page: similar to an ISAM index. 1 per row group: if ordered >> > just maxes and offsets >> > - add optional field in footer metadata. >> > >> > >> > >> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]> >> wrote: >> > >> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up >> > > >> > > -- >> > > Julien >> > > >> > >> > >> > >> > -- >> > Julien >> > >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > > -- > Julien
