It requires extra conversion when using code expecting millis timestamps. That's probably not a strong argument against it except we now have data stored in that format. Those types were added a while ago: https://issues.apache.org/jira/browse/PARQUET-12
On Thu, Mar 9, 2017 at 6:15 PM, Marcel Kornacker <[email protected]> wrote: > Timestamp_millis seems like a subset of Timestamp_micros, unless I'm > missing something: both need 8 bytes of storage, and you can obviously > pad the former by multiplying with 1000 to arrive at the latter. > Postgres supports timestamp_micros with a range of 4713BC/294276AD, > and while dropping to a millisecond resolution will give you a wider > range of years, I cannot imagine anyone needing that. > > Is there a reason why an application that wants to store > millisecond-resolution timestamps can't simply use timestamp_micros? > > On Wed, Mar 8, 2017 at 2:39 PM, Ryan Blue <[email protected]> wrote: > > TIMESTAMP_MILLIS is a common format for applications that aren't SQL > engines > > and is intended as a way for those apps to mark timestamps. SQL engines > > would ideally recognize those values and be able to read them. > > > > rb > > > > On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker <[email protected]> > wrote: > >> > >> One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in > >> addition to TIMESTAMP_MICROS? From SQL perspective, only the latter > >> is needed. > >> > >> On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem <[email protected]> > wrote: > >> > 2. The other thing to look into is HyperLogLog for approximate > distinct > >> > value count. Similar concepts than Bloom filters > >> > > >> > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue <[email protected]> > >> > wrote: > >> > > >> >> To follow up on the bloom filter discussion: The discussion on > >> >> PARQUET-41 > >> >> <https://issues.apache.org/jira/browse/PARQUET-41> has a lot of > >> >> information > >> >> and context for the bloom filter spreadsheet > >> >> <https://docs.google.com/spreadsheets/d/ > 1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw > >> >> qiiFe8Sazclf5Y/edit?usp=sharing> > >> >> I mentioned in the sync-up. The main things we need to worry about > are: > >> >> > >> >> 1. When are bloom filters worth using? Columns with low % unique will > >> >> already be dictionary-encoded and dictionary filtering has no > >> >> false-positives. > >> >> 2. How should Parquet track the % unique for a column to size the > bloom > >> >> filter correctly? 2x overloading results in a 10x increase in > >> >> false-positives, so this must avoid overloading. > >> >> 3. How should Parquet set the target false-positive probability? This > >> >> is > >> >> related to the number of lookups in queries. 1% FPP with 5 lookups > >> >> results > >> >> in 4.9% FPP for a query. > >> >> > >> >> I think there was also some analysis of page level vs row-group level > >> >> bloom > >> >> filters and using geometrically decreasing FPP (scalable bloom > >> >> filters). > >> >> > >> >> rb > >> >> > >> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]> > >> >> wrote: > >> >> > >> >> > Notes: > >> >> > > >> >> > Attendees/Agenda: > >> >> > Zoltan (Cloudera, file formats): > >> >> > - timestamp types > >> >> > Ryan (Netflix): > >> >> > - timestamp types > >> >> > - fix for sorting metadata (min-max) > >> >> > Deepak (Vertica, parquet-cpp): > >> >> > - timestamp > >> >> > Emily (IBM Spark Technology center) > >> >> > Greg (Cloudera): > >> >> > - timestamp > >> >> > Lars (Cloudera impala): > >> >> > - min-max (https://github.com/apache/parquet-format/pull/46) > >> >> > Marcel (Cl Impala): > >> >> > - timestamp > >> >> > - sorting/min max > >> >> > - bloom filters > >> >> > Julien (Dremio): > >> >> > - sorting/min max > >> >> > - timestamp. > >> >> > > >> >> > - Timestamp (2 types): > >> >> > - Floating Timestamp > >> >> > - ambiguity to the TZ: year/month/day/microseconds is the data > >> >> stored. > >> >> > - timezone less > >> >> > - same binary representation as current Timestamp. Different > >> >> > logical > >> >> > annotation. > >> >> > - how to store metadata. Same binary format w/wo. > >> >> > - action: Ryan to propose a PR on parquet-format > >> >> > - Timestamp with Timezone. > >> >> > - stored in UTC > >> >> > - client side conversion to UTC > >> >> > - writer timezone should be stored in the metadata? > >> >> > - need to clarify if time can be adjusted. > >> >> > - Int96: to be deprecated > >> >> > - int64 used instead with logical type. > >> >> > - won’t fix int96 ordering. Instead use replacement type. > >> >> > - Lars to update the JIRA (PARQUET-323) > >> >> > - new binary format : int64 storing actual date (year month day) > + > >> >> > microseconds since midnight. > >> >> > - Marcel to open a JIRA. > >> >> > - Sorting: > >> >> > - Ryan to update the the PR ( > >> >> > https://github.com/apache/parquet-format/pull/46) > >> >> > - Bloom filter: (PARQUET-319, PARQUET-41) > >> >> > - take analysis from original PR: > >> >> > - https://github.com/apache/parquet-mr/pull/215 > >> >> > - https://github.com/apache/parquet-format/pull/28 > >> >> > - need to define metadata. > >> >> > - C++ code reuse between parquet-cpp, impala, … > >> >> > - impala team to discuss how they want to do that. > >> >> > - store page level stats in footer (PARQUET-907) > >> >> > - several options: > >> >> > - Index Page: similar to an ISAM index. 1 per row group: if > >> >> > ordered > >> >> > just maxes and offsets > >> >> > - add optional field in footer metadata. > >> >> > > >> >> > > >> >> > > >> >> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]> > >> >> wrote: > >> >> > > >> >> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet- > sync-up > >> >> > > > >> >> > > -- > >> >> > > Julien > >> >> > > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Julien > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Ryan Blue > >> >> Software Engineer > >> >> Netflix > >> >> > >> > > >> > > >> > > >> > -- > >> > Julien > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Julien
