Re: [DISCUSS] Format additions for encoding/compression

Wes McKinney Thu, 23 Jan 2020 10:20:56 -0800

Parquet is most relevant in scenarios filesystem IO is constrained
(spinning rust HDD, network FS, cloud storage / S3 / GCS). For those
use cases memory-mapped Arrow is not viable.


Against local NVMe (> 2000 MB/s read throughput) your mileage may vary.

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques
<[email protected]> wrote:
>
> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <[email protected]> wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <[email protected]> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed tools
> > to
> > > > transform columns for fixed-length data types according to a "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Reply via email to