Uwe, I'll try pinpointing things further with `columns=` and try to reproduce what I find with data I can share.
Thanks for the pointer. -Bryant On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote: > No, there is no need to pass any options on reading. Sometimes they are > beneficial depending on what you want to achieve but defaults are ok, too. > > I'm not sure if you're able to post an example but it would be nice if you > could post the resulting Arrow schema from the table. It might be related > to a specific type. A quick way to debug this on your side would also be to > specify only a subset of columns to read using the `columns=` attribute on > read_table. Maybe you can already pinpoint the memory problems to a > specific column. Having these hints would it make easier for us to diagnose > what the underlying problem is. > > Uwe > > On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote: > > Uwe, > > > > I am not. Should I be? I forgot to mention earlier that the Parquet file > > came from Spark/PySpark. > > > > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote: > > > > > Hello Bryant, > > > > > > are you using any options on `pyarrow.parquet.read_table` or a possible > > > `to_pandas` afterwards? > > > > > > Uwe > > > > > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote: > > > > I tried reading a Parquet file (<200MB, lots of text with snappy) > using > > > > read_table and saw the memory usage peak over 8GB before settling > back > > > down > > > > to ~200MB. This surprised me as I was expecting to be able to handle > a > > > > Parquet file of this size with much less RAM (doing some processing > with > > > > smaller VMs). > > > > > > > > I am not sure if this expected, but I thought I might check with > everyone > > > > here and learn something new. Poking around it seems to be related > with > > > > ParquetReader.read_all? > > > > > > > > Thanks in advance, > > > > Bryant > > > >