Thanks to the dozens of folks who have found time to read the design
googledoc since the last Parquet sync.

Now that the traffic peak at the doc is over, I'll be handling the overlap
with the new Encryption.md file. It is becoming difficult and unnecessary
to maintain two versions in parallel, therefore the overlapping part will
be removed from the googledoc. The Encryption.md
<https://github.com/apache/parquet-format/pull/101/files> (formatted here
<https://github.com/ggershinsky/parquet-format/blob/p1232-encryption-docs/Encryption.md>)
and the current Thrift file
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift>
together provide a technically accurate, down to a single byte, description
of the encryption format and the writer/reader protocol. You can leave new
comments at the document pull request.

Old comments are still available at the google doc, press the comments
button for the Dec'17 to Aug'18 comment history. Also, you can read the
review comments at pull requests, merged (94
<https://github.com/apache/parquet-format/pull/94>, 103
<https://github.com/apache/parquet-format/pull/103>, 104
<https://github.com/apache/parquet-format/pull/104> in parquet-format, 463
<https://github.com/apache/parquet-cpp/pull/463>, 464
<https://github.com/apache/parquet-cpp/pull/464> in parquet-cpp) and open (
95 <https://github.com/apache/parquet-format/pull/95>*, 471
<https://github.com/apache/parquet-mr/pull/471>, 472
<https://github.com/apache/parquet-mr/pull/472> in parquet-mr and 475
<https://github.com/apache/parquet-cpp/pull/475> in parquet-cpp).

Besides comment history, the google doc will keep the API description
("Usage samples" section). The sample code is in Java, but the same API is
available in the C++ Parquet version (thanks Tham Ha for the hard work on
this!).

Cheers, Gidon.



On Wed, Aug 29, 2018 at 12:41 PM Nandor Kollar <nkol...@cloudera.com.invalid>
wrote:

> Hi all,
>
> Yesterday we talked about the status of the columnar encryption, and
> agreed that before anything related to it gets released, we need a
> reviewed spec. Actually Gidon already opened PR for this:
> https://github.com/apache/parquet-format/pull/101, it is based on the
> design doc (
> https://docs.google.com/document/d/1T89G7xR0zHFV1f2pjTO28jtfVm8qoNVGEJQ70Rsk-bY/edit
> )
> written by him. Julien, Ryan what do you think - is there anything
> else needed?
>
> Regards,
> Nandor
>
> On Tue, Aug 28, 2018 at 7:16 PM, Julien Le Dem
> <julien.le...@wework.com.invalid> wrote:
> > Notes:
> > Anna (Cloudera): Bloom filter update, Iceberg
> > Gabor, Nandor (Cloudera):
> >
> >    - Value skipping implementation to be reviewed. Move Java code from
> >    parquet-format to parquet-mr. PR ready
> >    - How can users of Parquet handle timestamps and TZs. Allow for
> writing
> >    timestamp in java. Refactor original type logic to more flexible new
> >    original type api.
> >    - Column indexes and alignment of pages
> >    - Limiting the number of records in a page to avoid skewed splits when
> >    compression is really good.
> >
> > Ryan (Netflix): Iceberg stuff back to Parquet: expression library for
> push
> > down. Dictionary and stats based row group filtering.
> > JunJie (Intel): Bloom filter. Need more reviews. Have a vote on the
> design
> > and add it to parquet-format.
> > Julien (Wework): Encryption.
> >
> >
> >    - Bloom Filter:
> >    https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41
> >    <
> https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41?filter=allopenissues
> >
> >    -
> >       - Committed utility class to parquet-cpp
> >       - Uploaded the benchmark result.
> >       - Ready to add into the spec.
> >       - Submit a PR for the parquet reader spec.
> >       - *Action*: review parquet java utility class.
> >       https://github.com/apache/parquet-mr/pull/425
> >       - Encryption:
> >    -
> >       - Nandor, Gabor reviewing.
> >       - Apis to allow pluggable key management.
> >       - Need to have a proper review of the spec.
> >       - Need more testing
> >       - Column indices:
> >    -
> >       - PR to be reviewed: https://github.com/apache/parquet-mr/pull/514
> >       - Ryan: to review features branch
> >       - Moving java code from parquet-format to parquet-mr:
> >    -
> >       - Action: review. https://github.com/apache/parquet-mr/pull/517
> >       - Gets the thrift file from the parquet-format released artifact.
> >       - Maximum number of records per page:
> >    -
> >       - We should add a property with a maximum number of records per
> page
> >       and per row group.
> >       - Need to benchmark to figure out a good default. 10K?
> >       - Iceberg:
> >    -
> >       - Some of the iceberg code should be in Parquet:
> >       -
> >          - Rewrote record reconstruction stack
> >          -
> >             - Reuses page reader and decoder
> >             - Then does a triple iterator that return an entire column
> in a
> >             file (iterator of triples)
> >             - Record reconstruction class that handles everything that
> the
> >             current one does but with {list, map} factories
> >             -
> >                - 20% faster to write, 5% faster to read
> >                - Easier to write object mappers
> >             - Helps with page level skipping.
> >             - High level abstractions in the iceberg library:
> >          -
> >             - Take an expression and simplify it (not, ...) to run on
> >             metadata
> >             - Take a complex expression and split the part on the
> >             partition/min/max and the remaining part.
> >
> >
> >
> >
> >
> >
> > On Mon, Aug 27, 2018 at 4:56 AM, Nandor Kollar
> <nkol...@cloudera.com.invalid
> >> wrote:
> >
> >> Yes, CEST.
> >>
> >> On Mon, Aug 27, 2018 at 1:01 PM, Uwe L. Korn <uw...@xhochy.com> wrote:
> >> > Hello Nador,
> >> >
> >> > probably I can make this time. Just a timezone question: Is it 6pm CET
> >> or 6pm CEST? I guess the latter.
> >> >
> >> > See http://timesched.pocoo.org/?date=2018-08-28&tz=central-
> >> europe-standard-time!,pacific-standard-time&range=1080,1140
> >> >
> >> > Uwe
> >> >
> >> > On Mon, Aug 27, 2018, at 12:20 PM, Nandor Kollar wrote:
> >> >> Hi All,
> >> >>
> >> >> As discussed on last Parquet sync, I propose to have an other meeting
> >> >> on August 28th, at 6pm CET / 9 am PST to discuss those topic which we
> >> >> didn't have time on the sync at August 15th, and of course any new
> >> >> topic too.
> >> >>
> >> >> Sorry for the late notice, feel free to propose other time slot if is
> >> >> is not suitable for you! Calendar entry to follow.
> >> >>
> >> >> Regards,
> >> >> Nandor
> >>
>

Reply via email to