Re: Parquet sync starting now on hangout

Julien Le Dem Fri, 10 Mar 2017 17:15:03 -0800

It requires extra conversion when using code expecting millis timestamps.
That's probably not a strong argument against it except we now have data
stored in that format.
Those types were added a while ago:
https://issues.apache.org/jira/browse/PARQUET-12


On Thu, Mar 9, 2017 at 6:15 PM, Marcel Kornacker <[email protected]> wrote:

> Timestamp_millis seems like a subset of Timestamp_micros, unless I'm
> missing something: both need 8 bytes of storage, and you can obviously
> pad the former by multiplying with 1000 to arrive at the latter.
> Postgres supports timestamp_micros with a range of 4713BC/294276AD,
> and while dropping to a millisecond resolution will give you a wider
> range of years, I cannot imagine anyone needing that.
>
> Is there a reason why an application that wants to store
> millisecond-resolution timestamps can't simply use timestamp_micros?
>
> On Wed, Mar 8, 2017 at 2:39 PM, Ryan Blue <[email protected]> wrote:
> > TIMESTAMP_MILLIS is a common format for applications that aren't SQL
> engines
> > and is intended as a way for those apps to mark timestamps. SQL engines
> > would ideally recognize those values and be able to read them.
> >
> > rb
> >
> > On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker <[email protected]>
> wrote:
> >>
> >> One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
> >> addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
> >> is needed.
> >>
> >> On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem <[email protected]>
> wrote:
> >> > 2. The other thing to look into is HyperLogLog for approximate
> distinct
> >> > value count. Similar concepts than Bloom filters
> >> >
> >> > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue <[email protected]>
> >> > wrote:
> >> >
> >> >> To follow up on the bloom filter discussion: The discussion on
> >> >> PARQUET-41
> >> >> <https://issues.apache.org/jira/browse/PARQUET-41> has a lot of
> >> >> information
> >> >> and context for the bloom filter spreadsheet
> >> >> <https://docs.google.com/spreadsheets/d/
> 1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw
> >> >> qiiFe8Sazclf5Y/edit?usp=sharing>
> >> >> I mentioned in the sync-up. The main things we need to worry about
> are:
> >> >>
> >> >> 1. When are bloom filters worth using? Columns with low % unique will
> >> >> already be dictionary-encoded and dictionary filtering has no
> >> >> false-positives.
> >> >> 2. How should Parquet track the % unique for a column to size the
> bloom
> >> >> filter correctly? 2x overloading results in a 10x increase in
> >> >> false-positives, so this must avoid overloading.
> >> >> 3. How should Parquet set the target false-positive probability? This
> >> >> is
> >> >> related to the number of lookups in queries. 1% FPP with 5 lookups
> >> >> results
> >> >> in 4.9% FPP for a query.
> >> >>
> >> >> I think there was also some analysis of page level vs row-group level
> >> >> bloom
> >> >> filters and using geometrically decreasing FPP (scalable bloom
> >> >> filters).
> >> >>
> >> >> rb
> >> >>
> >> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem <[email protected]>
> >> >> wrote:
> >> >>
> >> >> > Notes:
> >> >> >
> >> >> > Attendees/Agenda:
> >> >> > Zoltan (Cloudera, file formats):
> >> >> >   - timestamp types
> >> >> > Ryan (Netflix):
> >> >> >   - timestamp types
> >> >> >   - fix for sorting metadata (min-max)
> >> >> > Deepak (Vertica, parquet-cpp):
> >> >> >   - timestamp
> >> >> > Emily (IBM Spark Technology center)
> >> >> > Greg (Cloudera):
> >> >> >  - timestamp
> >> >> > Lars (Cloudera impala):
> >> >> >  - min-max (https://github.com/apache/parquet-format/pull/46)
> >> >> > Marcel (Cl Impala):
> >> >> >  - timestamp
> >> >> >  - sorting/min max
> >> >> >  - bloom filters
> >> >> > Julien (Dremio):
> >> >> >  - sorting/min max
> >> >> >  - timestamp.
> >> >> >
> >> >> > - Timestamp (2 types):
> >> >> >   - Floating Timestamp
> >> >> >     - ambiguity to the TZ: year/month/day/microseconds is the data
> >> >> stored.
> >> >> >     - timezone less
> >> >> >     - same binary representation as current Timestamp. Different
> >> >> > logical
> >> >> > annotation.
> >> >> >     - how to store metadata. Same binary format w/wo.
> >> >> >     - action: Ryan to propose a PR on parquet-format
> >> >> >   - Timestamp with Timezone.
> >> >> >     - stored in UTC
> >> >> >     - client side conversion to UTC
> >> >> >     - writer timezone should be stored in the metadata?
> >> >> >   - need to clarify if time can be adjusted.
> >> >> >   - Int96: to be deprecated
> >> >> >     - int64 used instead with logical type.
> >> >> >     - won’t fix int96 ordering. Instead use replacement type.
> >> >> >     - Lars to update the JIRA (PARQUET-323)
> >> >> >   - new binary format : int64 storing actual date (year month day)
> +
> >> >> > microseconds since midnight.
> >> >> >     - Marcel to open a JIRA.
> >> >> > - Sorting:
> >> >> >   - Ryan to update the the PR (
> >> >> > https://github.com/apache/parquet-format/pull/46)
> >> >> > - Bloom filter: (PARQUET-319, PARQUET-41)
> >> >> >   - take analysis from original PR:
> >> >> >     - https://github.com/apache/parquet-mr/pull/215
> >> >> >     - https://github.com/apache/parquet-format/pull/28
> >> >> >   - need to define metadata.
> >> >> > - C++ code reuse between parquet-cpp, impala, …
> >> >> >   - impala team to discuss how they want to do that.
> >> >> > - store page level stats in footer (PARQUET-907)
> >> >> >   - several options:
> >> >> >     - Index Page: similar to an ISAM index. 1 per row group: if
> >> >> > ordered
> >> >> > just maxes and offsets
> >> >> >     - add optional field in footer metadata.
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-
> sync-up
> >> >> > >
> >> >> > > --
> >> >> > > Julien
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Julien
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Ryan Blue
> >> >> Software Engineer
> >> >> Netflix
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Julien
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>



-- 
Julien

Re: Parquet sync starting now on hangout

Reply via email to