[jira] [Updated] (PARQUET-478) Reassembly algorithms for Arrow in-memory columnar memory layout

Wes McKinney (JIRA) Thu, 18 Feb 2016 14:32:45 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney updated PARQUET-478:
---------------------------------
    Description: 
I plan to use parquet-cpp primarily in conjunction with columnar data 
structures (http://arrow.apache.org). 

Specifically, this requires in the interpretation of repetition / definition 
levels:

* Computing null bits / bytes for each logical level of nested tree (group, 
array, primitive leaf)
* Computing implied array sizes for each repeated group (according to 1, 2, or 
3-level array encoding)

The results of this reconstruction will be simply C arrays accompanied by the 
parquet-cpp logical schema; this way we can make it easy to adapt to different 
in-memory columnar memory schemes. 

As far as implementation, it would make sense to proceed first with functional 
unit tests of the reassembly algorithms using repetition / definition levels 
declared in the test suite as C++ vectors -- otherwise it's going to be too 
tedious trying to produce valid Parquet test data files which explore all of 
the different edge cases.

Several other teams (Spark, Drill, Parquet-Java) are currently working on 
related efforts along these lines, so we can engage when appropriate to 
collaborate on algorithms and nuances of this approach to avoid unnecessary 
code churn / bugs. 

  was:
I plan to use parquet-cpp primarily in conjunction with columnar data 
structures. 

Specifically, this requires in the interpretation of repetition / definition 
levels:

* Computing null bits / bytes for each logical level of nested tree (group, 
array, primitive leaf)
* Computing implied array sizes for each repeated group (according to 1, 2, or 
3-level array encoding)

The results of this reconstruction will be simply C arrays accompanied by the 
parquet-cpp logical schema; this way we can make it easy to adapt to different 
in-memory columnar memory schemes. 

As far as implementation, it would make sense to proceed first with functional 
unit tests of the reassembly algorithms using repetition / definition levels 
declared in the test suite as C++ vectors -- otherwise it's going to be too 
tedious trying to produce valid Parquet test data files which explore all of 
the different edge cases.

Several other teams (Spark, Drill, Parquet-Java) are currently working on 
related efforts along these lines, so we can engage when appropriate to 
collaborate on algorithms and nuances of this approach to avoid unnecessary 
code churn / bugs. 

        Summary: Reassembly algorithms for Arrow in-memory columnar memory 
layout  (was: Reassembly algorithms for nested in-memory columnar memory layout)

> Reassembly algorithms for Arrow in-memory columnar memory layout
> ----------------------------------------------------------------
>
>                 Key: PARQUET-478
>                 URL: https://issues.apache.org/jira/browse/PARQUET-478
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> I plan to use parquet-cpp primarily in conjunction with columnar data 
> structures (http://arrow.apache.org). 
> Specifically, this requires in the interpretation of repetition / definition 
> levels:
> * Computing null bits / bytes for each logical level of nested tree (group, 
> array, primitive leaf)
> * Computing implied array sizes for each repeated group (according to 1, 2, 
> or 3-level array encoding)
> The results of this reconstruction will be simply C arrays accompanied by the 
> parquet-cpp logical schema; this way we can make it easy to adapt to 
> different in-memory columnar memory schemes. 
> As far as implementation, it would make sense to proceed first with 
> functional unit tests of the reassembly algorithms using repetition / 
> definition levels declared in the test suite as C++ vectors -- otherwise it's 
> going to be too tedious trying to produce valid Parquet test data files which 
> explore all of the different edge cases.
> Several other teams (Spark, Drill, Parquet-Java) are currently working on 
> related efforts along these lines, so we can engage when appropriate to 
> collaborate on algorithms and nuances of this approach to avoid unnecessary 
> code churn / bugs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PARQUET-478) Reassembly algorithms for Arrow in-memory columnar memory layout

Reply via email to