Hello. Thanks for the reply!
On Sun, Jun 16, 2019 at 8:40 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > On Sun, Jun 16, 2019 at 12:16 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Bogdan, > > I'm not an expert here but answers based on my understanding are below: > > > > 1) Is there something I'm missing in understanding difference between > > > serializing dataframe directly using PyArrow and serializing > > > `pyarrow.Table`, Table shines in case dataframes mostly consists of > > > strings, which is frequent in our cases. > > > > Since you have mixed type code the underlying data is ultimately pickled > > when serializing the dataframe with your code snippet: > > > https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515 > > I > > think this explains the performance difference. > > > That totally explains it. I did debugged and yes, it pickles the dtype=object column. > > > > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It > > > seems to "just works", but mostly people just stick to Parquet or > something > > > else. > > > > The Arrow format, in general, is NOT currently recommended for long term > > storage. > > > > I think after the 1.0.0 protocol version is released, we can begin to > recommend Arrow for cold storage of data (as in "you'll be able to > read these files in a year or two"), but design-wise it isn't intended > as a data warehousing format like Parquet or ORC. > > > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory / > > > storage size, but quite slower on half-text dataframes, (2-3x slower). > > > Could I be doing something wrong? > > > > Parquet might be trying to do some sort of encoding. I'm not sure why > > Feather would be slower then pyarrow.Table (but not an expert in > feather). > > > > In case of mixed-type dataframes JSON still seems like an option > according > > > to our benchmarks. > > > > If you wanted to use Arrow as a format probably the right approach here > > would be to make a new Union column for mixed-type columns. This would > > potentially slow down the write side, but make reading much quicker. > > > > 4) Feather seems to be REALLY close and similar in all benchmarks in > > > pyarrow.Table. Is feather using pyarrow.Table under the hood? > > > > My understanding is that the formats are nearly identical (mostly just a > > difference in metadata) so the performance similarity isn't surprising. > Alright, so speaking of serialization of pyarrow.Table vs Feather, if they are pretty much the same, but arrow alone shouldn't be used to long-storage, is this also the case for Feather or can it be a valid option for my case? > > > On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com> > wrote: > > > > > Trying to come up with a solution for quick Pandas dataframes > serialization > > > and long-storage. Dataframe content is tabular, but provided by user, > can > > > be arbitrary, so might both completely text columns and completely > > > numeric/boolean columns. > > > > > > ## Main goals are: > > > > > > * Serialize dataframe as quickly as possible in order to dump it on > disk. > > > > > > * Use format, that i'll be able to load from disk later back into > > > dataframe. > > > > > > * Well, the least memory footprint of serialization and compact output > > > file. > > > > > > Have ran benchmarks comparing different serialization methods, > including: > > > > > > * Parquet: `df.to_parquet()` > > > * Feather: `df.to_feather()` > > > * JSON: `df.to_json()` > > > * CSV: `df.to_csv()` > > > * PyArrow: `pyarrow.default_serialization_context().serialize(df)` > > > * PyArrow.Table: > > > > > > > `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))` > > > > > > Speed of serialization and memory footprint during that are probably > > > biggest factors (read: get rid of data, dump it to disk asap). > > > > > > Strangely in our benchmarks serializing `pyarrow.Table` seems the most > > > balanced and quite fast. > > > > > > ## Questions: > > > > > > 1) Is there something I'm missing in understanding difference between > > > serializing dataframe directly using PyArrow and serializing > > > `pyarrow.Table`, Table shines in case dataframes mostly consists of > > > strings, which is frequent in our cases. > > > > > > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It > > > seems to "just works", but mostly people just stick to Parquet or > something > > > else. > > > > > > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory / > > > storage size, but quite slower on half-text dataframes, (2-3x slower). > > > Could I be doing something wrong? > > > > > > In case of mixed-type dataframes JSON still seems like an option > according > > > to our benchmarks. > > > > > > 4) Feather seems to be REALLY close and similar in all benchmarks in > > > pyarrow.Table. Is feather using pyarrow.Table under the hood? > > > > > > ---------------------------------------------------- > > > ## Benchmarks: > > > > > > > https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0 > > > > > > Since we have mixed-type columns, for the following methods we do > > > astype(str) for all dtype=object columns before serialization: > > > * pyarrow.Table > > > * feather > > > * parquet > > > > > > It's also expensive but needed to be done since mixed-type columns are > not > > > supported for serialization in specified formats. Time to perform this > IS > > > INCLUDED into benchmarks. > > > > > > -- > > > Best wishes, > > > Bogdan Klichuk > > > > -- Best wishes, Bogdan Klichuk