[jira] [Commented] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Antoine Pitrou (Jira) Thu, 24 Sep 2020 03:43:29 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201440#comment-17201440
 ]


Antoine Pitrou commented on ARROW-10052:
----------------------------------------

I'm not sure there's anything surprising. Running this thing a bit (in debug 
mode!), I see that each RSS usage grows by 500-1000 bytes each column chunk 
(that is, each column in a row group).

This seems to be simply the Parquet file metadata accumulating before it can be 
written at the end (when the ParquetWriter is closed).  
{{format::FileMetadata}} has a vector of {{format::RowGroup}} (one per row 
group). {{format::RowGroup}} has a vector of {{format::Column}} (one per 
column). Each {{format::Column}} holds non-trivial information: file name, 
column metadata (itself potentially large).

So, basically you should write only large row groups to Parquet files. Writing 
100 rows at a time makes the Parquet format completely inadequate. Replace that 
with at least 10000 or 100000 rows, IMHO.

> [Python] Incrementally using ParquetWriter keeps data in memory (eventually 
> running out of RAM for large datasets)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10052
>                 URL: https://issues.apache.org/jira/browse/ARROW-10052
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Niklas B
>            Priority: Minor
>
> This ticket refers to the discussion between me and [~emkornfield] on the 
> MailingList: "Incrementally using ParquetWriter without keeping entire 
> dataset in memory (large than memory parquet files)" (not yet available on 
> the mail archives)
> Original post:
> {quote}Hi,
>  I'm trying to write a large parquet file onto disk (larger then memory) 
> using PyArrows ParquetWriter and write_table, but even though the file is 
> written incrementally to disk it still appears to keeps the entire dataset in 
> memory (eventually getting OOM killed). Basically what I am trying to do is:
>  with pq.ParquetWriter(
>                  output_file,
>                  arrow_schema,
>                  compression='snappy',
>                  allow_truncated_timestamps=True,
>                  version='2.0',  # Highest available schema
>                  data_page_version='2.0',  # Highest available schema
>          ) as writer:
>              for rows_dataframe in function_that_yields_data():
>                  writer.write_table(
>                      pa.Table.from_pydict(
>                              rows_dataframe,
>                              arrow_schema
>                      )
>                  )
>  Where I have a function that yields data and then write it in chunks using 
> write_table. 
>  Is it possible to force the ParquetWriter to not keep the entire dataset in 
> memory, or is it simply not possible for good reasons?
>  I’m streaming data from a database and writes it to Parquet. The 
> end-consumer has plenty of ram, but the machine that does the conversion 
> doesn’t. 
>  Regards,
>  Niklas
> {quote}
> Minimum example (I can't attach as a file for some reason) 
> [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
> Looking at it now when I've made a minimal example I see something I didn't 
> see/realize before which is that while the memory usage is increasing it 
> doesn't appear to be linear to the file written. This indicates (I guess) 
> that it isn't actually storing the written dataset, but something else. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Reply via email to