Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Niklas B Mon, 21 Sep 2020 04:59:21 -0700

Hi,

I’ve tried both with little success. I made a JIRA: 
https://issues.apache.org/jira/browse/ARROW-10052 
<https://issues.apache.org/jira/browse/ARROW-10052>


Looking at it now when I've made a minimal example I see something I didn't 
see/realize before which is that while the memory usage is increasing it 
doesn't appear to be linear to the file written. This possibly indicates (I 
guess) that it isn't actually storing the written dataset, but something else. 

I’ll keep digging, sorry for it not being as clear as I would have wanted it. 
In real world we see writing a 3 GB parquet file exhausting 10GB of memory when 
writing incrementally. 

Regards,
Niklas

> On 20 Sep 2020, at 06:07, Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
> Hi Niklas,
> Two suggestions:
> * Try to adjust row_group_size on write_table [1] to a smaller then default
> value.  If I read the code correctly this is currently 64 million rows [2],
> which seems potentially two high as a default (I'll open a JIRA about this).
> * If this is on linux/mac try setting the jemalloc decay which can return
> memory the the OS more quickly [3]
> 
> Just to confirm this is a local disk (not a blob store?) that you are
> writing to?
> 
> If you can produce a minimal example that still seems to hold onto all
> memory, after trying these two items please open a JIRA as there could be a
> bug or some unexpected buffering happening.
> <https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156>
> 
> Thanks,
> Micah
> 
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write_table
> [2]
> https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156
> [3]
> https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156
> 
> On Tue, Sep 15, 2020 at 8:46 AM Niklas B <niklas.biv...@enplore.com> wrote:
> 
>> First of all: Thank you so much for all hard work on Arrow, it’s an
>> awesome project.
>> 
>> Hi,
>> 
>> I'm trying to write a large parquet file onto disk (larger then memory)
>> using PyArrows ParquetWriter and write_table, but even though the file is
>> written incrementally to disk it still appears to keeps the entire dataset
>> in memory (eventually getting OOM killed). Basically what I am trying to do
>> is:
>> 
>> with pq.ParquetWriter(
>>                output_file,
>>                arrow_schema,
>>                compression='snappy',
>>                allow_truncated_timestamps=True,
>>                version='2.0',  # Highest available schema
>>                data_page_version='2.0',  # Highest available schema
>>        ) as writer:
>>            for rows_dataframe in function_that_yields_data():
>>                writer.write_table(
>>                    pa.Table.from_pydict(
>>                            rows_dataframe,
>>                            arrow_schema
>>                    )
>>                )
>> 
>> Where I have a function that yields data and then write it in chunks using
>> write_table.
>> 
>> Is it possible to force the ParquetWriter to not keep the entire dataset
>> in memory, or is it simply not possible for good reasons?
>> 
>> I’m streaming data from a database and writes it to Parquet. The
>> end-consumer has plenty of ram, but the machine that does the conversion
>> doesn’t.
>> 
>> Regards,
>> Niklas
>> 
>> PS: I’ve also created a stack overflow question, which I will update with
>> any answer I might get from the mailing list
>> 
>> https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Reply via email to