> * There seems to be some interaction between > `parquet::internal::RecordReader` and `arrow::PoolBuffer` or > `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold the > entire column in memory without compression/encoding even though Arrow > supports dictionary encoding (and the column is dictionary encoded).
This is quite tricky. The Parquet format allows for dictionary encoding as a data compression strategy, but it's not the same thing as Arrow's dictionary encoding, where a common dictionary is shared amongst one or more record batches. In Parquet, the dictionary will likely change from row group to row group. So, in general, the only reliably correct way to decode the Parquet file is to decode the dictionary encoded values into dense / materialized form. We have some JIRAs open about passing through dictionary indices to Arrow without decoding (which can cause memory use problems when you have a lot of strings). This is doable, but it's quite a lot of work because we must account for the case where the dictionary changes when reading from the next row group. We also cannot determine the in-memory C++ Arrow schema from the Parquet metadata alone (since we need to see the data to determine the dictionary) - Wes On Tue, May 29, 2018 at 9:01 AM, Bryant Menn <bryant.m...@gmail.com> wrote: > Following up on what I have found with Uwe's advice and poking around the > code base. > > * `columns=` helped but it was because forced me to realize I did not need > all of the columns at once every time. No particular column was > significantly worse in memory usage. > * There seems to be some interaction between > `parquet::internal::RecordReader` and `arrow::PoolBuffer` or > `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold > the entire column in memory without compression/encoding even though Arrow > supports dictionary encoding (and the column is dictionary encoded). > > I imagine `RecordReader` requests enough memory to hold the data without > encoding/compression for good reason (perhaps more robust assumptions about > the underlying memory pool?), but is there a way to request only the memory > require for dictionary encoding when it is an option? > > My (incomplete) understanding comes from the surrounding lines here > https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/record_reader.cc#L232 > . > > On Wed, Apr 25, 2018 at 2:23 PM Bryant Menn <bryant.m...@gmail.com> wrote: > >> Uwe, >> >> I'll try pinpointing things further with `columns=` and try to reproduce >> what I find with data I can share. >> >> Thanks for the pointer. >> >> -Bryant >> >> On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote: >> >>> No, there is no need to pass any options on reading. Sometimes they are >>> beneficial depending on what you want to achieve but defaults are ok, too. >>> >>> I'm not sure if you're able to post an example but it would be nice if >>> you could post the resulting Arrow schema from the table. It might be >>> related to a specific type. A quick way to debug this on your side would >>> also be to specify only a subset of columns to read using the `columns=` >>> attribute on read_table. Maybe you can already pinpoint the memory problems >>> to a specific column. Having these hints would it make easier for us to >>> diagnose what the underlying problem is. >>> >>> Uwe >>> >>> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote: >>> > Uwe, >>> > >>> > I am not. Should I be? I forgot to mention earlier that the Parquet file >>> > came from Spark/PySpark. >>> > >>> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote: >>> > >>> > > Hello Bryant, >>> > > >>> > > are you using any options on `pyarrow.parquet.read_table` or a >>> possible >>> > > `to_pandas` afterwards? >>> > > >>> > > Uwe >>> > > >>> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote: >>> > > > I tried reading a Parquet file (<200MB, lots of text with snappy) >>> using >>> > > > read_table and saw the memory usage peak over 8GB before settling >>> back >>> > > down >>> > > > to ~200MB. This surprised me as I was expecting to be able to >>> handle a >>> > > > Parquet file of this size with much less RAM (doing some processing >>> with >>> > > > smaller VMs). >>> > > > >>> > > > I am not sure if this expected, but I thought I might check with >>> everyone >>> > > > here and learn something new. Poking around it seems to be related >>> with >>> > > > ParquetReader.read_all? >>> > > > >>> > > > Thanks in advance, >>> > > > Bryant >>> > > >>> >>