Re: Parquet sync meeting minutes

Zoltan Ivanfi Fri, 17 Aug 2018 09:51:43 -0700

Hi,

Sorry, that was an error on my side, I suggested Nandor to add a TLDR
section with this title. I agree with your comment, Wes, outcome would have
been a better choice of word than decision.


Br,

Zoltan

On Fri, Aug 17, 2018 at 6:36 PM Wes McKinney <[email protected]> wrote:

> hi Nandor,
>
> A fine detail, and I may be wrong, but I don't think decisions can
> technically be made on a call because time zones do not permit
> everyone to join always and not all collaborators are comfortable
> having live discussions in English. see [1]
>
> You can present the consensus of the participants in the call summary
> and others in the community have an opportunity to provide feedback.
> The "decision" is therefore one based on lazy consensus thereafter if
> there are no objections or follow up discussion
>
> - Wes
>
> [1]: https://www.apache.org/foundation/how-it-works.html#management
>
> On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
> <[email protected]> wrote:
> > Topics discussed and decisions (meeting held on 2018 August 15th, at
> > 6pm CET / 9 am PST):
> >
> > - Aligning page row boundaries between different columns: Debated,
> > please follow-up
> > - Remove Java specific code from parquet-format: Accepted
> > - Column encryption: Please review
> > - Parquet-format release: Scope accepted
> > - C++ mono-repo: Please vote
> >
> >
> >
> > Aligning page row boundaries between different columns (Gabor)
> > --------------------------------------------------------------
> >
> > Background: In the existing specification of column indexes, page
> > boundaries are not aligned between different column in respect to row
> > count.
> >
> > Gabor: implemented this logic, interested parties can review the code
> here:
> > - https://github.com/apache/parquet-mr/pull/509
> > - https://github.com/apache/parquet-mr/commits/column-indexes
> >
> > Main takeaway from implementation:
> >
> > - Index filtering logic as currently specified is overcomplicated.
> > - May become a maintenance burden and results in steep learning curve
> > for onboarding - new developers.
> > - Can not be made transparent, vectorized readers (Hive, Spark) have
> > to implement a similar logic.
> >
> > Suggestion:
> >
> > - Align page row boundaries between different columns, i.e. the n-th
> > page of every column should contain the same number of rows.
> > - Filtering logic would be a lot simpler.
> > - Vectorized readers will get index-based filtering without any change
> > required on their side.
> >
> > Response:
> > - Ryan doesn't recommend it. Performance numbers?
> > - Discuss offline or on dev mailing list
> > - Timeline for reaching decision? Within a week. (Gabor already has a
> > working implementation.)
> >
> >
> >
> > Remove Java specific code from parquet-format (Nandor)
> > ------------------------------------------------------
> >
> > Background: Parquet-format contains a few Java classes. Earlier no
> > changes were required in these, but this has changed in recent
> > features, especially with the new column encryption feature, which
> > would add substantial new code.
> >
> > Suggestion (Nandor): Instead of cluttering parquet-format further with
> > java-specific code, move these classes to parquet-mr and deprecate
> > them in parquet-format.
> >
> > What is the motivation behind the status quo? Julien: We may need a
> > different Thrift version in the parquet-thrift binding than in the
> > parquet files themselves. If we move these classes to parquet-mr, we
> > should shade thrift. Additionally, currently a thrift-compiler is only
> > needed for parquet-format, not parquet-mr, this will change. Gabor:
> > Dockerization may help.
> >
> > Julien: We could merge the two repos altogether as well. Gabor: This,
> > however would move the specification into the Java implementation,
> > which would be against the cross-language ideology, so let's keep the
> > separate repo for the format. Zoltan: Other language binding should
> > also consider directly using it instead of copying parquet.thrift into
> > their source code.
> >
> >
> >
> > Column encryption (Gidon)
> > -------------------------
> >
> > Under development:
> > - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> > - Anonymization and data masking (PARQUET-1376)
> >
> > Java PRs under review:
> > - https://github.com/apache/parquet-mr/pull/471
> > - https://github.com/apache/parquet-mr/pull/472
> >
> > C++ PR:
> > - https://github.com/apache/parquet-cpp/pull/475
> >
> >
> > We need more testing (both unit tests and interop tests between Java and
> C++).
> >
> >
> >
> > Parquet-format release (Zoltan)
> > -------------------------------
> >
> > Suggested scope (Zoltan):
> > - Column encryption
> > - Nanosec precision
> > - Anything else?
> >
> > Discussion:
> > - Nothing else to add.
> > - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
> >
> >
> >
> > C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> > --------------------------------------------------
> >
> >
> > Background: duplicated CI system and codebase, circular dependencies
> > between libraries
> >
> > Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
> > read here:
> https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
> >
> >
> > Resolution: No objections but no final decision either, vote on the
> > parquet list:
> https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E
>

Re: Parquet sync meeting minutes

Reply via email to