Hi Raphael,
One suggestion that might make your use cases work better in parquet is to
build some encoding/decoding logic into your application. If I understand
it correctly you are storing strings in the format "${ISO_8601
timestamp}.${String1}.${string2}".
If this is the case you can split the string into its individual components
and convert the date to Seconds or Milliseconds since the Unix epoch, and
then store the three columns individually (1 integer, 2 strings) into a
parquet file and do the reverse conversion when you read the data. An
alternative to conversion would be to split the date string into two or
more substrings (e.g. split date and time or go even further and split
year, month, date, hour, minutes seconds) which would also work much better
then one large string.
Hope this helps.
Micah
On Tue, Jan 28, 2020 at 1:57 PM Wes McKinney <[email protected]> wrote:
> hi Raphael,
>
> yes, to elaborate on my comment from SO -- Parquet files apply a
> "layered" encoding and compression strategy that works really well on
> datasets with a lot of repeated values. This can yield substantially
> better compression ratios than naive compression (simply compressing
> bytes using a compressor like Snappy, ZLIB, or LZ4).
>
> Where Parquet may perform less well is in cases where the values are
> mostly unique. In the Python case you showed, the problem is made
> worse by the fact that the unique strings have to be converted into
> PyString objects using the C API PyString_FromStringAndSize (pickle
> has to do this, too, but there's more decoding / decompression effort
> that has to be done first when reading the Parquet file).
>
> Parquet has a lot of benefits over Python pickle, not the least that
> it can be read by many different systems and can be processed in
> chunks (versus an all-or-nothing file load)
>
> Others may have other comments. Hope this helps
>
> - Wes
>
> On Tue, Jan 28, 2020 at 3:27 PM Attie, Raphael (GSFC-671.0)[GEORGE
> MASON UNIVERSITY] <[email protected]> wrote:
> >
> > Dear Wes,
> >
> > I am responding to your offer to discuss my post on Stackoverflow at:
> https://stackoverflow.com/questions/59432045/pandas-dataframe-slower-to-read-from-parquet-than-from-pickle-file?noredirect=1#comment105050134_59432045
> >
> > You have explained in the comment section that the kind of dataset I was
> manipulating is not ideal for Parquet. I would be happy to know more about
> this.
> >
> > Here is more context:
> > I am working at NASA Goddard Space Flight Center on data from the Solar
> Dynamics Observatory. It sends 1.5 TB of data per day of observations of
> the Sun. The dataset I am currently working, and described in my SO post
> are a subset of the metadata associated with those observations.
> >
> > I got interested in using Parquet as I got an error using HDF5. My too
> big dataset resulted simply in an error, whereas a smaller version of it
> had no problem. Parquet was working regardless of the size of the dataset.
> Also, I am going to use dask and dask withing GPUs ( from Nvidia RAPIDS)
> which I believe support Parquet. This is what got me interested in using
> this format.
> >
> > Thanks
> >
> > Raphael Attie
> >
> >
> > - - - - - - - - - -- - - - -- - - - -- - - - -- - - - -- -
> > Dr. Raphael AttiƩ (GSFC-6710)
> > NASA / Goddard Space Flight Center
> > George Mason University
> > Office (NASA GSFC, room 041): 301-286-0360
> > Cell: 301-631-4954
> > Email (1): [email protected]<mailto:Email%20(1):%
> [email protected]>
> > Email (2): [email protected]<mailto:[email protected]>
> >
>