Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Micah Kornfield Sat, 15 Jun 2019 22:17:10 -0700

Hi Bogdan,
I'm not an expert here but answers based on my understanding are below:


1) Is there something I'm missing in understanding difference between
> serializing dataframe directly using PyArrow and serializing
> `pyarrow.Table`, Table shines in case dataframes mostly consists of
> strings, which is frequent in our cases.

Since you have mixed type code the underlying data is ultimately pickled
when serializing the dataframe with your code snippet:
https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515
I
think this explains the performance difference.


2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
> seems to "just works", but mostly people just stick to Parquet or something
> else.

The Arrow format, in general, is NOT currently recommended for long term
storage.

3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
> storage size, but quite slower on half-text dataframes, (2-3x slower).
> Could I be doing something wrong?

Parquet might be trying to do some sort of encoding.  I'm not sure why
Feather would be slower then pyarrow.Table (but not an expert in feather).

In case of mixed-type dataframes JSON still seems like an option according
> to our benchmarks.

If you wanted to use Arrow as a format probably the right approach here
would be to make a new Union column for mixed-type columns.  This would
potentially slow down the write side, but make reading much quicker.

4) Feather seems to be REALLY close and similar in all benchmarks in
> pyarrow.Table. Is feather using pyarrow.Table under the hood?

My understanding is that the formats are nearly identical (mostly just a
difference in metadata) so the performance similarity isn't surprising.

On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com> wrote:

> Trying to come up with a solution for quick Pandas dataframes serialization
> and long-storage. Dataframe content is tabular, but provided by user, can
> be arbitrary, so might both completely text columns and completely
> numeric/boolean columns.
>
> ## Main goals are:
>
> * Serialize dataframe as quickly as possible in order to dump it on disk.
>
> * Use format, that i'll be able to load from disk later back into
> dataframe.
>
> * Well, the least memory footprint of serialization and compact output
> file.
>
> Have ran benchmarks comparing different serialization methods, including:
>
> * Parquet: `df.to_parquet()`
> * Feather: `df.to_feather()`
> * JSON: `df.to_json()`
> * CSV: `df.to_csv()`
> * PyArrow: `pyarrow.default_serialization_context().serialize(df)`
> * PyArrow.Table:
>
> `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))`
>
> Speed of serialization and memory footprint during that are probably
> biggest factors (read: get rid of data, dump it to disk asap).
>
> Strangely in our benchmarks serializing `pyarrow.Table` seems the most
> balanced and quite fast.
>
> ## Questions:
>
> 1) Is there something I'm missing in understanding difference between
> serializing dataframe directly using PyArrow and serializing
> `pyarrow.Table`, Table shines in case dataframes mostly consists of
> strings, which is frequent in our cases.
>
> 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
> seems to "just works", but mostly people just stick to Parquet or something
> else.
>
> 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
> storage size, but quite slower on half-text dataframes, (2-3x slower).
> Could I be doing something wrong?
>
> In case of mixed-type dataframes JSON still seems like an option according
> to our benchmarks.
>
> 4) Feather seems to be REALLY close and similar in all benchmarks in
> pyarrow.Table. Is feather using pyarrow.Table under the hood?
>
> ----------------------------------------------------
> ## Benchmarks:
>
> https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0
>
> Since we have mixed-type columns, for the following methods we do
> astype(str) for all dtype=object columns before serialization:
>   * pyarrow.Table
>   * feather
>   * parquet
>
> It's also expensive but needed to be done since mixed-type columns are not
> supported for serialization in specified formats. Time to perform this IS
> INCLUDED into benchmarks.
>
> --
> Best wishes,
> Bogdan Klichuk
>

Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Reply via email to