PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Niklas B Tue, 15 Sep 2020 08:46:42 -0700

First of all: Thank you so much for all hard work on Arrow, it’s an awesome 
project.


Hi,

I'm trying to write a large parquet file onto disk (larger then memory) using 
PyArrows ParquetWriter and write_table, but even though the file is written 
incrementally to disk it still appears to keeps the entire dataset in memory 
(eventually getting OOM killed). Basically what I am trying to do is:

with pq.ParquetWriter(
                output_file,
                arrow_schema,
                compression='snappy',
                allow_truncated_timestamps=True,
                version='2.0',  # Highest available schema
                data_page_version='2.0',  # Highest available schema
        ) as writer:
            for rows_dataframe in function_that_yields_data():
                writer.write_table(
                    pa.Table.from_pydict(
                            rows_dataframe,
                            arrow_schema
                    )
                )

Where I have a function that yields data and then write it in chunks using 
write_table. 

Is it possible to force the ParquetWriter to not keep the entire dataset in 
memory, or is it simply not possible for good reasons?

I’m streaming data from a database and writes it to Parquet. The end-consumer 
has plenty of ram, but the machine that does the conversion doesn’t. 

Regards,
Niklas

PS: I’ve also created a stack overflow question, which I will update with any 
answer I might get from the mailing list 
https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Reply via email to