Topics discussed and decisions (meeting held on 2018 August 15th, at
6pm CET / 9 am PST):

- Aligning page row boundaries between different columns: Debated,
please follow-up
- Remove Java specific code from parquet-format: Accepted
- Column encryption: Please review
- Parquet-format release: Scope accepted
- C++ mono-repo: Please vote



Aligning page row boundaries between different columns (Gabor)
--------------------------------------------------------------

Background: In the existing specification of column indexes, page
boundaries are not aligned between different column in respect to row
count.

Gabor: implemented this logic, interested parties can review the code here:
- https://github.com/apache/parquet-mr/pull/509
- https://github.com/apache/parquet-mr/commits/column-indexes

Main takeaway from implementation:

- Index filtering logic as currently specified is overcomplicated.
- May become a maintenance burden and results in steep learning curve
for onboarding - new developers.
- Can not be made transparent, vectorized readers (Hive, Spark) have
to implement a similar logic.

Suggestion:

- Align page row boundaries between different columns, i.e. the n-th
page of every column should contain the same number of rows.
- Filtering logic would be a lot simpler.
- Vectorized readers will get index-based filtering without any change
required on their side.

Response:
- Ryan doesn't recommend it. Performance numbers?
- Discuss offline or on dev mailing list
- Timeline for reaching decision? Within a week. (Gabor already has a
working implementation.)



Remove Java specific code from parquet-format (Nandor)
------------------------------------------------------

Background: Parquet-format contains a few Java classes. Earlier no
changes were required in these, but this has changed in recent
features, especially with the new column encryption feature, which
would add substantial new code.

Suggestion (Nandor): Instead of cluttering parquet-format further with
java-specific code, move these classes to parquet-mr and deprecate
them in parquet-format.

What is the motivation behind the status quo? Julien: We may need a
different Thrift version in the parquet-thrift binding than in the
parquet files themselves. If we move these classes to parquet-mr, we
should shade thrift. Additionally, currently a thrift-compiler is only
needed for parquet-format, not parquet-mr, this will change. Gabor:
Dockerization may help.

Julien: We could merge the two repos altogether as well. Gabor: This,
however would move the specification into the Java implementation,
which would be against the cross-language ideology, so let's keep the
separate repo for the format. Zoltan: Other language binding should
also consider directly using it instead of copying parquet.thrift into
their source code.



Column encryption (Gidon)
-------------------------

Under development:
- Key management API (doesn’t provide E2E key management) (PARQUET-1373)
- Anonymization and data masking (PARQUET-1376)

Java PRs under review:
- https://github.com/apache/parquet-mr/pull/471
- https://github.com/apache/parquet-mr/pull/472

C++ PR:
- https://github.com/apache/parquet-cpp/pull/475


We need more testing (both unit tests and interop tests between Java and C++).



Parquet-format release (Zoltan)
-------------------------------

Suggested scope (Zoltan):
- Column encryption
- Nanosec precision
- Anything else?

Discussion:
- Nothing else to add.
- Wes welcomes the nano precision, will be needed in parquet-cpp as well.



C++ mono-repo: merging Arrow and parquet-cpp (Wes)
--------------------------------------------------


Background: duplicated CI system and codebase, circular dependencies
between libraries

Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
read here: 
https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E


Resolution: No objections but no final decision either, vote on the
parquet list: 
https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E

Reply via email to