[ 
https://issues.apache.org/jira/browse/ARROW-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-14781:
---------------------------------------
    Summary: [Docs][Python] Improved Tooling/Documentation on Constructing 
Larger than Memory Parquet  (was: Improved Tooling/Documentation on 
Constructing Larger than Memory Parquet)

> [Docs][Python] Improved Tooling/Documentation on Constructing Larger than 
> Memory Parquet
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-14781
>                 URL: https://issues.apache.org/jira/browse/ARROW-14781
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Documentation, Python
>            Reporter: Damien Ready
>            Priority: Minor
>
> I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. 
> My objective is to incrementally build a parquet dataset holding the 
> collection. I can only hold a small subset of the data in memory.
> Following the docs as best I could, I was able to hack together a workflow 
> that will do what I need, but it seems overly complex. I hope my problem is 
> not out of scope, so I would love it if there was an effort to:
> 1) streamline the APIs to make this more straight-forward
> 2) better documentation on how to approach this problem
> 3) out of the box CLI utilities that would do this without any effort on my 
> part
> Expanding on 3), I was imagining something like a `parquet-cat`, 
> `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would 
> allow interacting with these files from the terminal. As it is, they are just 
> blobs that require additional tooling to get even the barest sense of what is 
> within.
> Reproducible example below. Happy to hear what I missed that would have made 
> this more straight-forward. Or that would also generate the parquet metadata 
> at the same time.
> EDIT: made example generate random dataframes so it can be run directly. Was 
> to close to my use case where I was reading files from disk
> {code:python}
> import itertools
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.dataset as ds
> def gen_batches():
>     NUM_CSV_FILES = 15
>     NUM_ROWS = 25
>     for _ in range(NUM_CSV_FILES):
>         dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)), 
> columns=list("abcde"))
>         # PyArrow dataset would only consume batches iterable
>         for batch in pa.Table.from_pandas(dataf).to_batches():
>             yield batch
> batches = gen_batches()
> # using the write_dataset method requires providing the schema, which is not 
> accessible from a batch?
> peek_batch = batches.__next__()
> # needed to build a table to get to the schema
> schema = pa.Table.from_batches([peek_batch]).schema
> # consumed the first entry of the generator, rebuild it here
> renew_gen_batches = itertools.chain([peek_batch], batches)
> ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", 
> format="parquet", schema=schema)
> # attempting write_dataset with an iterable of Tables threw: 
> pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object 
> of type 'pyarrow.lib.Table'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to