Re: Date and time for next Parquet sync

Julien Le Dem Tue, 28 Aug 2018 10:16:58 -0700

Notes:
Anna (Cloudera): Bloom filter update, Iceberg
Gabor, Nandor (Cloudera):


   - Value skipping implementation to be reviewed. Move Java code from
   parquet-format to parquet-mr. PR ready
   - How can users of Parquet handle timestamps and TZs. Allow for writing
   timestamp in java. Refactor original type logic to more flexible new
   original type api.
   - Column indexes and alignment of pages
   - Limiting the number of records in a page to avoid skewed splits when
   compression is really good.

Ryan (Netflix): Iceberg stuff back to Parquet: expression library for push
down. Dictionary and stats based row group filtering.
JunJie (Intel): Bloom filter. Need more reviews. Have a vote on the design
and add it to parquet-format.
Julien (Wework): Encryption.


   - Bloom Filter:
   https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41
   
<https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41?filter=allopenissues>
   -
      - Committed utility class to parquet-cpp
      - Uploaded the benchmark result.
      - Ready to add into the spec.
      - Submit a PR for the parquet reader spec.
      - *Action*: review parquet java utility class.
      https://github.com/apache/parquet-mr/pull/425
      - Encryption:
   -
      - Nandor, Gabor reviewing.
      - Apis to allow pluggable key management.
      - Need to have a proper review of the spec.
      - Need more testing
      - Column indices:
   -
      - PR to be reviewed: https://github.com/apache/parquet-mr/pull/514
      - Ryan: to review features branch
      - Moving java code from parquet-format to parquet-mr:
   -
      - Action: review. https://github.com/apache/parquet-mr/pull/517
      - Gets the thrift file from the parquet-format released artifact.
      - Maximum number of records per page:
   -
      - We should add a property with a maximum number of records per page
      and per row group.
      - Need to benchmark to figure out a good default. 10K?
      - Iceberg:
   -
      - Some of the iceberg code should be in Parquet:
      -
         - Rewrote record reconstruction stack
         -
            - Reuses page reader and decoder
            - Then does a triple iterator that return an entire column in a
            file (iterator of triples)
            - Record reconstruction class that handles everything that the
            current one does but with {list, map} factories
            -
               - 20% faster to write, 5% faster to read
               - Easier to write object mappers
            - Helps with page level skipping.
            - High level abstractions in the iceberg library:
         -
            - Take an expression and simplify it (not, ...) to run on
            metadata
            - Take a complex expression and split the part on the
            partition/min/max and the remaining part.






On Mon, Aug 27, 2018 at 4:56 AM, Nandor Kollar <nkol...@cloudera.com.invalid
> wrote:

> Yes, CEST.
>
> On Mon, Aug 27, 2018 at 1:01 PM, Uwe L. Korn <uw...@xhochy.com> wrote:
> > Hello Nador,
> >
> > probably I can make this time. Just a timezone question: Is it 6pm CET
> or 6pm CEST? I guess the latter.
> >
> > See http://timesched.pocoo.org/?date=2018-08-28&tz=central-
> europe-standard-time!,pacific-standard-time&range=1080,1140
> >
> > Uwe
> >
> > On Mon, Aug 27, 2018, at 12:20 PM, Nandor Kollar wrote:
> >> Hi All,
> >>
> >> As discussed on last Parquet sync, I propose to have an other meeting
> >> on August 28th, at 6pm CET / 9 am PST to discuss those topic which we
> >> didn't have time on the sync at August 15th, and of course any new
> >> topic too.
> >>
> >> Sorry for the late notice, feel free to propose other time slot if is
> >> is not suitable for you! Calendar entry to follow.
> >>
> >> Regards,
> >> Nandor
>

Re: Date and time for next Parquet sync

Reply via email to