[jira] [Updated] (ARROW-3949) parquet cpp - improve examples

Rajeshwar Agrawal (JIRA) Thu, 06 Dec 2018 00:48:50 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajeshwar Agrawal updated ARROW-3949:
-------------------------------------
    Description: 
It would be a great to have examples of using parquet arrow high-level API for 
the following 2 cases
 * Storing nested data types (storing nested data types is touted as major 
merit of parquet, so I think this case should be included as an example). 
Ideally, an example of how to use {{arrow::StructArray}} nested with several 
primities types, list types and other nested types would cover every case of 
nested hierarchy of complex data representations
 * Buffered or Batched writes to parquet file. Parquet is meant to be used for 
large amounts of data. The current examples stores all of the data as in arrow 
data structures, before writing to parquet file, which has a huge memory 
footprint, proportional to the amount of data being stored. An example of 
writing directly to row groups and columns, can nicely demonstrate how to store 
data with smaller memory footprint. The current example creates a 
{{arrow::Table}}, which needs to be filled with {{arrow::Array}}(s) of entire 
data, size of which is bounded by the amount of RAM. Ideally, an example which 
generates some data in several {{arrow::Array}}(s), and then stores (appends) 
them as a new Row Group (or Column Chunk) in a {{parquet::arrow::FileWriter}}, 
using {{NewRowGroup}} and {{WriteColumnChunk}} functions, thus demonstrating a 
lower memory footprint for writing a parquet file with huge amounts of data

  was:
It would be a great help to have examples of using parquet arrow high-level API 
for the following 2 cases
 * Storing nested data types (storing nested data types is touted as major 
merit of parquet, so I think this case should be included as an example). 
Ideally, an example of how to use StructArray nested with several primities 
types, list types and other struct type would cover every case of nested 
hierarchy of complex data representations
 * Buffered or Batched writes to parquet file. Parquet is meant to be used for 
large amounts of data. The current examples store all of the data as in arrow 
data structures, before writing to parquet file. Would be great to include an 
example of batched writes, which is helpful in most use cases of parquet. The 
current example creates a {{arrow::Table}}, which needs to be filled with 
{{arrow::Array}}(s) of entire data. Ideally, an example which generates some 
data in several {{arrow::Array}}(s), and then stores (appends) them as a new 
Row Group (or Column Chunk) in an existing (new) parquet file (writer), using 
{{NewRowGroup}} and {{WriteColumnChunk}} functions, thus demonstrating a lower 
memory footprint for writing a parquet file


> parquet cpp - improve examples
> ------------------------------
>
>                 Key: ARROW-3949
>                 URL: https://issues.apache.org/jira/browse/ARROW-3949
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Rajeshwar Agrawal
>            Priority: Minor
>
> It would be a great to have examples of using parquet arrow high-level API 
> for the following 2 cases
>  * Storing nested data types (storing nested data types is touted as major 
> merit of parquet, so I think this case should be included as an example). 
> Ideally, an example of how to use {{arrow::StructArray}} nested with several 
> primities types, list types and other nested types would cover every case of 
> nested hierarchy of complex data representations
>  * Buffered or Batched writes to parquet file. Parquet is meant to be used 
> for large amounts of data. The current examples stores all of the data as in 
> arrow data structures, before writing to parquet file, which has a huge 
> memory footprint, proportional to the amount of data being stored. An example 
> of writing directly to row groups and columns, can nicely demonstrate how to 
> store data with smaller memory footprint. The current example creates a 
> {{arrow::Table}}, which needs to be filled with {{arrow::Array}}(s) of entire 
> data, size of which is bounded by the amount of RAM. Ideally, an example 
> which generates some data in several {{arrow::Array}}(s), and then stores 
> (appends) them as a new Row Group (or Column Chunk) in a 
> {{parquet::arrow::FileWriter}}, using {{NewRowGroup}} and 
> {{WriteColumnChunk}} functions, thus demonstrating a lower memory footprint 
> for writing a parquet file with huge amounts of data



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3949) parquet cpp - improve examples

Reply via email to