[
https://issues.apache.org/jira/browse/PARQUET-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney moved ARROW-3949 to PARQUET-1526:
----------------------------------------------
Component/s: (was: C++)
parquet-cpp
Workflow: patch-available, re-open possible (was: jira)
Key: PARQUET-1526 (was: ARROW-3949)
Project: Parquet (was: Apache Arrow)
> [C++] parquet cpp - improve examples
> ------------------------------------
>
> Key: PARQUET-1526
> URL: https://issues.apache.org/jira/browse/PARQUET-1526
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Rajeshwar Agrawal
> Priority: Minor
>
> It would be a great to have examples of using parquet arrow high-level API
> for the following 2 cases
> * Storing nested data types (storing nested data types is touted as major
> merit of parquet, so I think this case should be included as an example).
> Ideally, an example of how to use {{arrow::StructArray}} nested with several
> primities types, list types and other nested types would cover every case of
> nested hierarchy of complex data representations
> * Buffered or Batched writes to parquet file. Parquet is meant to be used
> for large amounts of data. The current examples stores all of the data as in
> arrow data structures, before writing to parquet file, which has a huge
> memory footprint, proportional to the amount of data being stored. An example
> of writing directly to row groups and columns, can nicely demonstrate how to
> store data with smaller memory footprint. The current example creates a
> {{arrow::Table}}, which needs to be filled with {{arrow::Array}}(s) of entire
> data, size of which is bounded by the amount of RAM. Ideally, an example
> which generates some data in several {{arrow::Array}}(s), and then stores
> (appends) them as a new Row Group (or Column Chunk) in a
> {{parquet::arrow::FileWriter}}, using {{NewRowGroup}} and
> {{WriteColumnChunk}} functions, thus demonstrating a lower memory footprint
> for writing a parquet file with huge amounts of data
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)