[GitHub] [arrow] wesm commented on issue #2624: `pyarrow.parquet.DataDataset` memory leak when reading and exporting Pandas

GitHub Tue, 25 Sep 2018 07:15:50 -0700

Thanks. I will do the math to see if there is legitimately a memory leak but 
the way that we are currently dealing with low-cardinality string columns is 
pretty unfavorable for you:


https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L408

There's some immediate recourses we should investigate:

* Pass values through a hash table to deduplicate -- we do this in pandas's CSV 
reader and this is critical for avoiding runaway memory use creating many 
copies of the same string

* Decode directly from Parquet to categorical (there are several JIRAs open 
about that already)

This is too much work to do for 0.11 but hopefully in the next month or so it 
can get done for the 0.12 release. We just merged the Arrow and Parquet C++ 
codebases together in part to make work like this easier for us to do

[ Full content available at: https://github.com/apache/arrow/issues/2624 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [arrow] wesm commented on issue #2624: `pyarrow.parquet.DataDataset` memory leak when reading and exporting Pandas

Reply via email to