Re: Parquet sync meeting minutes

Wes McKinney Fri, 17 Aug 2018 09:36:24 -0700

hi Nandor,

A fine detail, and I may be wrong, but I don't think decisions can
technically be made on a call because time zones do not permit
everyone to join always and not all collaborators are comfortable
having live discussions in English. see [1]


You can present the consensus of the participants in the call summary
and others in the community have an opportunity to provide feedback.
The "decision" is therefore one based on lazy consensus thereafter if
there are no objections or follow up discussion

- Wes

[1]: https://www.apache.org/foundation/how-it-works.html#management

On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
<[email protected]> wrote:
> Topics discussed and decisions (meeting held on 2018 August 15th, at
> 6pm CET / 9 am PST):
>
> - Aligning page row boundaries between different columns: Debated,
> please follow-up
> - Remove Java specific code from parquet-format: Accepted
> - Column encryption: Please review
> - Parquet-format release: Scope accepted
> - C++ mono-repo: Please vote
>
>
>
> Aligning page row boundaries between different columns (Gabor)
> --------------------------------------------------------------
>
> Background: In the existing specification of column indexes, page
> boundaries are not aligned between different column in respect to row
> count.
>
> Gabor: implemented this logic, interested parties can review the code here:
> - https://github.com/apache/parquet-mr/pull/509
> - https://github.com/apache/parquet-mr/commits/column-indexes
>
> Main takeaway from implementation:
>
> - Index filtering logic as currently specified is overcomplicated.
> - May become a maintenance burden and results in steep learning curve
> for onboarding - new developers.
> - Can not be made transparent, vectorized readers (Hive, Spark) have
> to implement a similar logic.
>
> Suggestion:
>
> - Align page row boundaries between different columns, i.e. the n-th
> page of every column should contain the same number of rows.
> - Filtering logic would be a lot simpler.
> - Vectorized readers will get index-based filtering without any change
> required on their side.
>
> Response:
> - Ryan doesn't recommend it. Performance numbers?
> - Discuss offline or on dev mailing list
> - Timeline for reaching decision? Within a week. (Gabor already has a
> working implementation.)
>
>
>
> Remove Java specific code from parquet-format (Nandor)
> ------------------------------------------------------
>
> Background: Parquet-format contains a few Java classes. Earlier no
> changes were required in these, but this has changed in recent
> features, especially with the new column encryption feature, which
> would add substantial new code.
>
> Suggestion (Nandor): Instead of cluttering parquet-format further with
> java-specific code, move these classes to parquet-mr and deprecate
> them in parquet-format.
>
> What is the motivation behind the status quo? Julien: We may need a
> different Thrift version in the parquet-thrift binding than in the
> parquet files themselves. If we move these classes to parquet-mr, we
> should shade thrift. Additionally, currently a thrift-compiler is only
> needed for parquet-format, not parquet-mr, this will change. Gabor:
> Dockerization may help.
>
> Julien: We could merge the two repos altogether as well. Gabor: This,
> however would move the specification into the Java implementation,
> which would be against the cross-language ideology, so let's keep the
> separate repo for the format. Zoltan: Other language binding should
> also consider directly using it instead of copying parquet.thrift into
> their source code.
>
>
>
> Column encryption (Gidon)
> -------------------------
>
> Under development:
> - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> - Anonymization and data masking (PARQUET-1376)
>
> Java PRs under review:
> - https://github.com/apache/parquet-mr/pull/471
> - https://github.com/apache/parquet-mr/pull/472
>
> C++ PR:
> - https://github.com/apache/parquet-cpp/pull/475
>
>
> We need more testing (both unit tests and interop tests between Java and C++).
>
>
>
> Parquet-format release (Zoltan)
> -------------------------------
>
> Suggested scope (Zoltan):
> - Column encryption
> - Nanosec precision
> - Anything else?
>
> Discussion:
> - Nothing else to add.
> - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
>
>
>
> C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> --------------------------------------------------
>
>
> Background: duplicated CI system and codebase, circular dependencies
> between libraries
>
> Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
> read here: 
> https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
>
>
> Resolution: No objections but no final decision either, vote on the
> parquet list: 
> https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E

Re: Parquet sync meeting minutes

Reply via email to