[jira] [Created] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Niklas B (Jira) Mon, 21 Sep 2020 02:13:41 -0700

Niklas B created ARROW-10052:
--------------------------------

             Summary: [Python] Incrementally using ParquetWriter keeps data in 
memory (eventually running out of RAM for large datasets)
                 Key: ARROW-10052
                 URL: https://issues.apache.org/jira/browse/ARROW-10052
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 1.0.1
            Reporter: Niklas B



This ticket refers to the discussion between me and [~emkornfield] on the 
MailingList: "Incrementally using ParquetWriter without keeping entire dataset 
in memory (large than memory parquet files)" (not yet available on the mail 
archives)

Original post:
{quote}Hi,
 I'm trying to write a large parquet file onto disk (larger then memory) using 
PyArrows ParquetWriter and write_table, but even though the file is written 
incrementally to disk it still appears to keeps the entire dataset in memory 
(eventually getting OOM killed). Basically what I am trying to do is:
 with pq.ParquetWriter(
                 output_file,
                 arrow_schema,
                 compression='snappy',
                 allow_truncated_timestamps=True,
                 version='2.0',  # Highest available schema
                 data_page_version='2.0',  # Highest available schema
         ) as writer:
             for rows_dataframe in function_that_yields_data():
                 writer.write_table(
                     pa.Table.from_pydict(
                             rows_dataframe,
                             arrow_schema
                     )
                 )
 Where I have a function that yields data and then write it in chunks using 
write_table. 
 Is it possible to force the ParquetWriter to not keep the entire dataset in 
memory, or is it simply not possible for good reasons?
 I’m streaming data from a database and writes it to Parquet. The end-consumer 
has plenty of ram, but the machine that does the conversion doesn’t. 
 Regards,
 Niklas
{quote}
Minimum example (I can't attach as a file for some reason) 
[https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]

Looking at it now when I've made a minimal example I see something I didn't 
see/realize before which is that while the memory usage is increasing it 
doesn't appear to be linear to the file written. This indicates (I guess) that 
it isn't actually storing the written dataset, but something else. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Reply via email to