Thanks. I will do the math to see if there is legitimately a memory leak but the way that we are currently dealing with low-cardinality string columns is pretty unfavorable for you:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L408 There's some immediate recourses we should investigate: * Pass values through a hash table to deduplicate -- we do this in pandas's CSV reader and this is critical for avoiding runaway memory use creating many copies of the same string * Decode directly from Parquet to categorical (there are several JIRAs open about that already) This is too much work to do for 0.11 but hopefully in the next month or so it can get done for the 0.12 release. We just merged the Arrow and Parquet C++ codebases together in part to make work like this easier for us to do [ Full content available at: https://github.com/apache/arrow/issues/2624 ] This message was relayed via gitbox.apache.org for [email protected]
