First of all: Thank you so much for all hard work on Arrow, it’s an awesome
project.
Hi,
I'm trying to write a large parquet file onto disk (larger then memory) using
PyArrows ParquetWriter and write_table, but even though the file is written
incrementally to disk it still appears to keeps the entire dataset in memory
(eventually getting OOM killed). Basically what I am trying to do is:
with pq.ParquetWriter(
output_file,
arrow_schema,
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
for rows_dataframe in function_that_yields_data():
writer.write_table(
pa.Table.from_pydict(
rows_dataframe,
arrow_schema
)
)
Where I have a function that yields data and then write it in chunks using
write_table.
Is it possible to force the ParquetWriter to not keep the entire dataset in
memory, or is it simply not possible for good reasons?
I’m streaming data from a database and writes it to Parquet. The end-consumer
has plenty of ram, but the machine that does the conversion doesn’t.
Regards,
Niklas
PS: I’ve also created a stack overflow question, which I will update with any
answer I might get from the mailing list
https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem