Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Wes McKinney Sun, 16 Jun 2019 05:41:33 -0700

hi Micah,

On Sun, Jun 16, 2019 at 12:16 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Bogdan,
> I'm not an expert here but answers based on my understanding are below:
>
> 1) Is there something I'm missing in understanding difference between
> > serializing dataframe directly using PyArrow and serializing
> > `pyarrow.Table`, Table shines in case dataframes mostly consists of
> > strings, which is frequent in our cases.
>
> Since you have mixed type code the underlying data is ultimately pickled
> when serializing the dataframe with your code snippet:
> https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515
> I
> think this explains the performance difference.
>
>
> 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
> > seems to "just works", but mostly people just stick to Parquet or something
> > else.
>
> The Arrow format, in general, is NOT currently recommended for long term
> storage.
>


I think after the 1.0.0 protocol version is released, we can begin to
recommend Arrow for cold storage of data (as in "you'll be able to
read these files in a year or two"), but design-wise it isn't intended
as a data warehousing format like Parquet or ORC.

> 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
> > storage size, but quite slower on half-text dataframes, (2-3x slower).
> > Could I be doing something wrong?
>
> Parquet might be trying to do some sort of encoding.  I'm not sure why
> Feather would be slower then pyarrow.Table (but not an expert in feather).
>
> In case of mixed-type dataframes JSON still seems like an option according
> > to our benchmarks.
>
> If you wanted to use Arrow as a format probably the right approach here
> would be to make a new Union column for mixed-type columns.  This would
> potentially slow down the write side, but make reading much quicker.
>
> 4) Feather seems to be REALLY close and similar in all benchmarks in
> > pyarrow.Table. Is feather using pyarrow.Table under the hood?
>
> My understanding is that the formats are nearly identical (mostly just a
> difference in metadata) so the performance similarity isn't surprising.
>
> On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com> wrote:
>
> > Trying to come up with a solution for quick Pandas dataframes serialization
> > and long-storage. Dataframe content is tabular, but provided by user, can
> > be arbitrary, so might both completely text columns and completely
> > numeric/boolean columns.
> >
> > ## Main goals are:
> >
> > * Serialize dataframe as quickly as possible in order to dump it on disk.
> >
> > * Use format, that i'll be able to load from disk later back into
> > dataframe.
> >
> > * Well, the least memory footprint of serialization and compact output
> > file.
> >
> > Have ran benchmarks comparing different serialization methods, including:
> >
> > * Parquet: `df.to_parquet()`
> > * Feather: `df.to_feather()`
> > * JSON: `df.to_json()`
> > * CSV: `df.to_csv()`
> > * PyArrow: `pyarrow.default_serialization_context().serialize(df)`
> > * PyArrow.Table:
> >
> > `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))`
> >
> > Speed of serialization and memory footprint during that are probably
> > biggest factors (read: get rid of data, dump it to disk asap).
> >
> > Strangely in our benchmarks serializing `pyarrow.Table` seems the most
> > balanced and quite fast.
> >
> > ## Questions:
> >
> > 1) Is there something I'm missing in understanding difference between
> > serializing dataframe directly using PyArrow and serializing
> > `pyarrow.Table`, Table shines in case dataframes mostly consists of
> > strings, which is frequent in our cases.
> >
> > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
> > seems to "just works", but mostly people just stick to Parquet or something
> > else.
> >
> > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
> > storage size, but quite slower on half-text dataframes, (2-3x slower).
> > Could I be doing something wrong?
> >
> > In case of mixed-type dataframes JSON still seems like an option according
> > to our benchmarks.
> >
> > 4) Feather seems to be REALLY close and similar in all benchmarks in
> > pyarrow.Table. Is feather using pyarrow.Table under the hood?
> >
> > ----------------------------------------------------
> > ## Benchmarks:
> >
> > https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0
> >
> > Since we have mixed-type columns, for the following methods we do
> > astype(str) for all dtype=object columns before serialization:
> >   * pyarrow.Table
> >   * feather
> >   * parquet
> >
> > It's also expensive but needed to be done since mixed-type columns are not
> > supported for serialization in specified formats. Time to perform this IS
> > INCLUDED into benchmarks.
> >
> > --
> > Best wishes,
> > Bogdan Klichuk
> >

Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Reply via email to