Wes McKinney created PARQUET-478:
------------------------------------
Summary: Reassembly algorithms for nested in-memory columnar
memory layout
Key: PARQUET-478
URL: https://issues.apache.org/jira/browse/PARQUET-478
Project: Parquet
Issue Type: New Feature
Components: parquet-cpp
Reporter: Wes McKinney
I plan to use parquet-cpp primarily in conjunction with columnar data
structures.
Specifically, this requires in the interpretation of repetition / definition
levels:
* Computing null bits / bytes for each logical level of nested tree (group,
array, primitive leaf)
* Computing implied array sizes for each repeated group (according to 1, 2, or
3-level array encoding)
The results of this reconstruction will be simply C arrays accompanied by the
parquet-cpp logical schema; this way we can make it easy to adapt to different
in-memory columnar memory schemes.
As far as implementation, it would make sense to proceed first with functional
unit tests of the reassembly algorithms using repetition / definition levels
declared in the test suite as C++ vectors -- otherwise it's going to be too
tedious trying to produce valid Parquet test data files which explore all of
the different edge cases.
Several other teams (Spark, Drill, Parquet-Java) are currently working on
related efforts along these lines, so we can engage when appropriate to
collaborate on algorithms and nuances of this approach to avoid unnecessary
code churn / bugs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)