[
https://issues.apache.org/jira/browse/ARROW-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201440#comment-17201440
]
Antoine Pitrou commented on ARROW-10052:
----------------------------------------
I'm not sure there's anything surprising. Running this thing a bit (in debug
mode!), I see that each RSS usage grows by 500-1000 bytes each column chunk
(that is, each column in a row group).
This seems to be simply the Parquet file metadata accumulating before it can be
written at the end (when the ParquetWriter is closed).
{{format::FileMetadata}} has a vector of {{format::RowGroup}} (one per row
group). {{format::RowGroup}} has a vector of {{format::Column}} (one per
column). Each {{format::Column}} holds non-trivial information: file name,
column metadata (itself potentially large).
So, basically you should write only large row groups to Parquet files. Writing
100 rows at a time makes the Parquet format completely inadequate. Replace that
with at least 10000 or 100000 rows, IMHO.
> [Python] Incrementally using ParquetWriter keeps data in memory (eventually
> running out of RAM for large datasets)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10052
> URL: https://issues.apache.org/jira/browse/ARROW-10052
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Niklas B
> Priority: Minor
>
> This ticket refers to the discussion between me and [~emkornfield] on the
> MailingList: "Incrementally using ParquetWriter without keeping entire
> dataset in memory (large than memory parquet files)" (not yet available on
> the mail archives)
> Original post:
> {quote}Hi,
> I'm trying to write a large parquet file onto disk (larger then memory)
> using PyArrows ParquetWriter and write_table, but even though the file is
> written incrementally to disk it still appears to keeps the entire dataset in
> memory (eventually getting OOM killed). Basically what I am trying to do is:
> with pq.ParquetWriter(
> output_file,
> arrow_schema,
> compression='snappy',
> allow_truncated_timestamps=True,
> version='2.0', # Highest available schema
> data_page_version='2.0', # Highest available schema
> ) as writer:
> for rows_dataframe in function_that_yields_data():
> writer.write_table(
> pa.Table.from_pydict(
> rows_dataframe,
> arrow_schema
> )
> )
> Where I have a function that yields data and then write it in chunks using
> write_table.
> Is it possible to force the ParquetWriter to not keep the entire dataset in
> memory, or is it simply not possible for good reasons?
> I’m streaming data from a database and writes it to Parquet. The
> end-consumer has plenty of ram, but the machine that does the conversion
> doesn’t.
> Regards,
> Niklas
> {quote}
> Minimum example (I can't attach as a file for some reason)
> [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
> Looking at it now when I've made a minimal example I see something I didn't
> see/realize before which is that while the memory usage is increasing it
> doesn't appear to be linear to the file written. This indicates (I guess)
> that it isn't actually storing the written dataset, but something else.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)