Niklas B created ARROW-10052:
--------------------------------
Summary: [Python] Incrementally using ParquetWriter keeps data in
memory (eventually running out of RAM for large datasets)
Key: ARROW-10052
URL: https://issues.apache.org/jira/browse/ARROW-10052
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 1.0.1
Reporter: Niklas B
This ticket refers to the discussion between me and [~emkornfield] on the
MailingList: "Incrementally using ParquetWriter without keeping entire dataset
in memory (large than memory parquet files)" (not yet available on the mail
archives)
Original post:
{quote}Hi,
I'm trying to write a large parquet file onto disk (larger then memory) using
PyArrows ParquetWriter and write_table, but even though the file is written
incrementally to disk it still appears to keeps the entire dataset in memory
(eventually getting OOM killed). Basically what I am trying to do is:
with pq.ParquetWriter(
output_file,
arrow_schema,
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
for rows_dataframe in function_that_yields_data():
writer.write_table(
pa.Table.from_pydict(
rows_dataframe,
arrow_schema
)
)
Where I have a function that yields data and then write it in chunks using
write_table.
Is it possible to force the ParquetWriter to not keep the entire dataset in
memory, or is it simply not possible for good reasons?
I’m streaming data from a database and writes it to Parquet. The end-consumer
has plenty of ram, but the machine that does the conversion doesn’t.
Regards,
Niklas
{quote}
Minimum example (I can't attach as a file for some reason)
[https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
Looking at it now when I've made a minimal example I see something I didn't
see/realize before which is that while the memory usage is increasing it
doesn't appear to be linear to the file written. This indicates (I guess) that
it isn't actually storing the written dataset, but something else.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)