[jira] [Commented] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Niklas B (Jira) Thu, 22 Oct 2020 05:24:53 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218975#comment-17218975
 ]


Niklas B commented on ARROW-10052:
----------------------------------

We can close it soon, is it okay if I have it open for a few more days while 
debugging? I'm running with row_group_size as 10000 and are seeing memory the 
writer using about 3GB memory (for a 3GB parq file, 9000 columns x 300 000 
rows). 

> [Python] Incrementally using ParquetWriter keeps data in memory (eventually 
> running out of RAM for large datasets)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10052
>                 URL: https://issues.apache.org/jira/browse/ARROW-10052
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Niklas B
>            Priority: Minor
>
> This ticket refers to the discussion between me and [~emkornfield] on the 
> MailingList: "Incrementally using ParquetWriter without keeping entire 
> dataset in memory (large than memory parquet files)" (not yet available on 
> the mail archives)
> Original post:
> {quote}Hi,
>  I'm trying to write a large parquet file onto disk (larger then memory) 
> using PyArrows ParquetWriter and write_table, but even though the file is 
> written incrementally to disk it still appears to keeps the entire dataset in 
> memory (eventually getting OOM killed). Basically what I am trying to do is:
>  with pq.ParquetWriter(
>                  output_file,
>                  arrow_schema,
>                  compression='snappy',
>                  allow_truncated_timestamps=True,
>                  version='2.0',  # Highest available schema
>                  data_page_version='2.0',  # Highest available schema
>          ) as writer:
>              for rows_dataframe in function_that_yields_data():
>                  writer.write_table(
>                      pa.Table.from_pydict(
>                              rows_dataframe,
>                              arrow_schema
>                      )
>                  )
>  Where I have a function that yields data and then write it in chunks using 
> write_table. 
>  Is it possible to force the ParquetWriter to not keep the entire dataset in 
> memory, or is it simply not possible for good reasons?
>  I’m streaming data from a database and writes it to Parquet. The 
> end-consumer has plenty of ram, but the machine that does the conversion 
> doesn’t. 
>  Regards,
>  Niklas
> {quote}
> Minimum example (I can't attach as a file for some reason) 
> [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
> Looking at it now when I've made a minimal example I see something I didn't 
> see/realize before which is that while the memory usage is increasing it 
> doesn't appear to be linear to the file written. This indicates (I guess) 
> that it isn't actually storing the written dataset, but something else. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Reply via email to