hi Raphael,

yes, to elaborate on my comment from SO -- Parquet files apply a
"layered" encoding and compression strategy that works really well on
datasets with a lot of repeated values. This can yield substantially
better compression ratios than naive compression (simply compressing
bytes using a compressor like Snappy, ZLIB, or LZ4).

Where Parquet may perform less well is in cases where the values are
mostly unique. In the Python case you showed, the problem is made
worse by the fact that the unique strings have to be converted into
PyString objects using the C API PyString_FromStringAndSize (pickle
has to do this, too, but there's more decoding / decompression effort
that has to be done first when reading the Parquet file).

Parquet has a lot of benefits over Python pickle, not the least that
it can be read by many different systems and can be processed in
chunks (versus an all-or-nothing file load)

Others may have other comments. Hope this helps

- Wes

On Tue, Jan 28, 2020 at 3:27 PM Attie, Raphael (GSFC-671.0)[GEORGE
MASON UNIVERSITY] <[email protected]> wrote:
>
> Dear Wes,
>
> I am responding to your offer to discuss my post on Stackoverflow at: 
> https://stackoverflow.com/questions/59432045/pandas-dataframe-slower-to-read-from-parquet-than-from-pickle-file?noredirect=1#comment105050134_59432045
>
> You have explained in the comment section that the kind of dataset I was 
> manipulating is not ideal for Parquet. I would be happy to know more about 
> this.
>
> Here is more context:
> I am working at NASA Goddard Space Flight Center on data from the Solar 
> Dynamics Observatory. It sends 1.5 TB of data per day of observations of the 
> Sun. The dataset I am currently working, and described in my SO post are a 
> subset of the metadata associated with those observations.
>
> I got interested in using Parquet as I got an error using HDF5. My too big 
> dataset resulted simply in an error, whereas a smaller version of it had no 
> problem. Parquet was working regardless of the size of the dataset. Also, I 
> am going to use dask and dask withing GPUs ( from Nvidia RAPIDS) which I 
> believe support Parquet. This is what got me interested in using this format.
>
> Thanks
>
> Raphael Attie
>
>
> - - - - - - - - - -- - - - -- - - - -- - - - -- - - - -- -
> Dr. Raphael AttiƩ (GSFC-6710)
> NASA / Goddard Space Flight Center
> George Mason University
> Office (NASA GSFC, room 041): 301-286-0360
> Cell: 301-631-4954
> Email (1): 
> [email protected]<mailto:Email%20(1):%[email protected]>
> Email (2): [email protected]<mailto:[email protected]>
>

Reply via email to