Re: Peak memory usage for pyarrow.parquet.read_table

Bryant Menn Wed, 25 Apr 2018 11:25:00 -0700

Uwe,

I'll try pinpointing things further with `columns=` and try to reproduce
what I find with data I can share.


Thanks for the pointer.

-Bryant

On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> No, there is no need to pass any options on reading. Sometimes they are
> beneficial depending on what you want to achieve but defaults are ok, too.
>
> I'm not sure if you're able to post an example but it would be nice if you
> could post the resulting Arrow schema from the table. It might be related
> to a specific type. A quick way to debug this on your side would also be to
> specify only a subset of columns to read using the `columns=` attribute on
> read_table. Maybe you can already pinpoint the memory problems to a
> specific column. Having these hints would it make easier for us to diagnose
> what the underlying problem is.
>
> Uwe
>
> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
> > Uwe,
> >
> > I am not. Should I be? I forgot to mention earlier that the Parquet file
> > came from Spark/PySpark.
> >
> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Bryant,
> > >
> > > are you using any options on `pyarrow.parquet.read_table` or a possible
> > > `to_pandas` afterwards?
> > >
> > > Uwe
> > >
> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > > > I tried reading a Parquet file (<200MB, lots of text with snappy)
> using
> > > > read_table and saw the memory usage peak over 8GB before settling
> back
> > > down
> > > > to ~200MB. This surprised me as I was expecting to be able to handle
> a
> > > > Parquet file of this size with much less RAM (doing some processing
> with
> > > > smaller VMs).
> > > >
> > > > I am not sure if this expected, but I thought I might check with
> everyone
> > > > here and learn something new. Poking around it seems to be related
> with
> > > > ParquetReader.read_all?
> > > >
> > > > Thanks in advance,
> > > > Bryant
> > >
>

Re: Peak memory usage for pyarrow.parquet.read_table

Reply via email to