[
https://issues.apache.org/jira/browse/ARROW-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacob Wujciak-Jens updated ARROW-14781:
---------------------------------------
Summary: [Docs][Python] Improved Tooling/Documentation on Constructing
Larger than Memory Parquet (was: Improved Tooling/Documentation on
Constructing Larger than Memory Parquet)
> [Docs][Python] Improved Tooling/Documentation on Constructing Larger than
> Memory Parquet
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-14781
> URL: https://issues.apache.org/jira/browse/ARROW-14781
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Documentation, Python
> Reporter: Damien Ready
> Priority: Minor
>
> I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM.
> My objective is to incrementally build a parquet dataset holding the
> collection. I can only hold a small subset of the data in memory.
> Following the docs as best I could, I was able to hack together a workflow
> that will do what I need, but it seems overly complex. I hope my problem is
> not out of scope, so I would love it if there was an effort to:
> 1) streamline the APIs to make this more straight-forward
> 2) better documentation on how to approach this problem
> 3) out of the box CLI utilities that would do this without any effort on my
> part
> Expanding on 3), I was imagining something like a `parquet-cat`,
> `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would
> allow interacting with these files from the terminal. As it is, they are just
> blobs that require additional tooling to get even the barest sense of what is
> within.
> Reproducible example below. Happy to hear what I missed that would have made
> this more straight-forward. Or that would also generate the parquet metadata
> at the same time.
> EDIT: made example generate random dataframes so it can be run directly. Was
> to close to my use case where I was reading files from disk
> {code:python}
> import itertools
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.dataset as ds
> def gen_batches():
> NUM_CSV_FILES = 15
> NUM_ROWS = 25
> for _ in range(NUM_CSV_FILES):
> dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)),
> columns=list("abcde"))
> # PyArrow dataset would only consume batches iterable
> for batch in pa.Table.from_pandas(dataf).to_batches():
> yield batch
> batches = gen_batches()
> # using the write_dataset method requires providing the schema, which is not
> accessible from a batch?
> peek_batch = batches.__next__()
> # needed to build a table to get to the schema
> schema = pa.Table.from_batches([peek_batch]).schema
> # consumed the first entry of the generator, rebuild it here
> renew_gen_batches = itertools.chain([peek_batch], batches)
> ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet",
> format="parquet", schema=schema)
> # attempting write_dataset with an iterable of Tables threw:
> pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object
> of type 'pyarrow.lib.Table'
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)