Trying to come up with a solution for quick Pandas dataframes serialization
and long-storage. Dataframe content is tabular, but provided by user, can
be arbitrary, so might both completely text columns and completely
numeric/boolean columns.

## Main goals are:

* Serialize dataframe as quickly as possible in order to dump it on disk.

* Use format, that i'll be able to load from disk later back into dataframe.

* Well, the least memory footprint of serialization and compact output file.

Have ran benchmarks comparing different serialization methods, including:

* Parquet: `df.to_parquet()`
* Feather: `df.to_feather()`
* JSON: `df.to_json()`
* CSV: `df.to_csv()`
* PyArrow: `pyarrow.default_serialization_context().serialize(df)`
* PyArrow.Table:
`pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))`

Speed of serialization and memory footprint during that are probably
biggest factors (read: get rid of data, dump it to disk asap).

Strangely in our benchmarks serializing `pyarrow.Table` seems the most
balanced and quite fast.

## Questions:

1) Is there something I'm missing in understanding difference between
serializing dataframe directly using PyArrow and serializing
`pyarrow.Table`, Table shines in case dataframes mostly consists of
strings, which is frequent in our cases.

2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
seems to "just works", but mostly people just stick to Parquet or something
else.

3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
storage size, but quite slower on half-text dataframes, (2-3x slower).
Could I be doing something wrong?

In case of mixed-type dataframes JSON still seems like an option according
to our benchmarks.

4) Feather seems to be REALLY close and similar in all benchmarks in
pyarrow.Table. Is feather using pyarrow.Table under the hood?

----------------------------------------------------
## Benchmarks:
https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0

Since we have mixed-type columns, for the following methods we do
astype(str) for all dtype=object columns before serialization:
  * pyarrow.Table
  * feather
  * parquet

It's also expensive but needed to be done since mixed-type columns are not
supported for serialization in specified formats. Time to perform this IS
INCLUDED into benchmarks.

-- 
Best wishes,
Bogdan Klichuk

Reply via email to