[
https://issues.apache.org/jira/browse/ARROW-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218975#comment-17218975
]
Niklas B commented on ARROW-10052:
----------------------------------
We can close it soon, is it okay if I have it open for a few more days while
debugging? I'm running with row_group_size as 10000 and are seeing memory the
writer using about 3GB memory (for a 3GB parq file, 9000 columns x 300 000
rows).
> [Python] Incrementally using ParquetWriter keeps data in memory (eventually
> running out of RAM for large datasets)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10052
> URL: https://issues.apache.org/jira/browse/ARROW-10052
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Niklas B
> Priority: Minor
>
> This ticket refers to the discussion between me and [~emkornfield] on the
> MailingList: "Incrementally using ParquetWriter without keeping entire
> dataset in memory (large than memory parquet files)" (not yet available on
> the mail archives)
> Original post:
> {quote}Hi,
> I'm trying to write a large parquet file onto disk (larger then memory)
> using PyArrows ParquetWriter and write_table, but even though the file is
> written incrementally to disk it still appears to keeps the entire dataset in
> memory (eventually getting OOM killed). Basically what I am trying to do is:
> with pq.ParquetWriter(
> output_file,
> arrow_schema,
> compression='snappy',
> allow_truncated_timestamps=True,
> version='2.0', # Highest available schema
> data_page_version='2.0', # Highest available schema
> ) as writer:
> for rows_dataframe in function_that_yields_data():
> writer.write_table(
> pa.Table.from_pydict(
> rows_dataframe,
> arrow_schema
> )
> )
> Where I have a function that yields data and then write it in chunks using
> write_table.
> Is it possible to force the ParquetWriter to not keep the entire dataset in
> memory, or is it simply not possible for good reasons?
> I’m streaming data from a database and writes it to Parquet. The
> end-consumer has plenty of ram, but the machine that does the conversion
> doesn’t.
> Regards,
> Niklas
> {quote}
> Minimum example (I can't attach as a file for some reason)
> [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
> Looking at it now when I've made a minimal example I see something I didn't
> see/realize before which is that while the memory usage is increasing it
> doesn't appear to be linear to the file written. This indicates (I guess)
> that it isn't actually storing the written dataset, but something else.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)