Topics discussed and decisions (meeting held on 2018 August 15th, at 6pm CET / 9 am PST):
- Aligning page row boundaries between different columns: Debated, please follow-up - Remove Java specific code from parquet-format: Accepted - Column encryption: Please review - Parquet-format release: Scope accepted - C++ mono-repo: Please vote Aligning page row boundaries between different columns (Gabor) -------------------------------------------------------------- Background: In the existing specification of column indexes, page boundaries are not aligned between different column in respect to row count. Gabor: implemented this logic, interested parties can review the code here: - https://github.com/apache/parquet-mr/pull/509 - https://github.com/apache/parquet-mr/commits/column-indexes Main takeaway from implementation: - Index filtering logic as currently specified is overcomplicated. - May become a maintenance burden and results in steep learning curve for onboarding - new developers. - Can not be made transparent, vectorized readers (Hive, Spark) have to implement a similar logic. Suggestion: - Align page row boundaries between different columns, i.e. the n-th page of every column should contain the same number of rows. - Filtering logic would be a lot simpler. - Vectorized readers will get index-based filtering without any change required on their side. Response: - Ryan doesn't recommend it. Performance numbers? - Discuss offline or on dev mailing list - Timeline for reaching decision? Within a week. (Gabor already has a working implementation.) Remove Java specific code from parquet-format (Nandor) ------------------------------------------------------ Background: Parquet-format contains a few Java classes. Earlier no changes were required in these, but this has changed in recent features, especially with the new column encryption feature, which would add substantial new code. Suggestion (Nandor): Instead of cluttering parquet-format further with java-specific code, move these classes to parquet-mr and deprecate them in parquet-format. What is the motivation behind the status quo? Julien: We may need a different Thrift version in the parquet-thrift binding than in the parquet files themselves. If we move these classes to parquet-mr, we should shade thrift. Additionally, currently a thrift-compiler is only needed for parquet-format, not parquet-mr, this will change. Gabor: Dockerization may help. Julien: We could merge the two repos altogether as well. Gabor: This, however would move the specification into the Java implementation, which would be against the cross-language ideology, so let's keep the separate repo for the format. Zoltan: Other language binding should also consider directly using it instead of copying parquet.thrift into their source code. Column encryption (Gidon) ------------------------- Under development: - Key management API (doesn’t provide E2E key management) (PARQUET-1373) - Anonymization and data masking (PARQUET-1376) Java PRs under review: - https://github.com/apache/parquet-mr/pull/471 - https://github.com/apache/parquet-mr/pull/472 C++ PR: - https://github.com/apache/parquet-cpp/pull/475 We need more testing (both unit tests and interop tests between Java and C++). Parquet-format release (Zoltan) ------------------------------- Suggested scope (Zoltan): - Column encryption - Nanosec precision - Anything else? Discussion: - Nothing else to add. - Wes welcomes the nano precision, will be needed in parquet-cpp as well. C++ mono-repo: merging Arrow and parquet-cpp (Wes) -------------------------------------------------- Background: duplicated CI system and codebase, circular dependencies between libraries Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be read here: https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E Resolution: No objections but no final decision either, vote on the parquet list: https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E
