Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Brian Hulette Mon, 22 Jul 2019 08:41:23 -0700

To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth that cost.


It's been suggested on this list and elsewhere [1] that sparse encodings
that can be operated on without fully decompressing should be added to the
Arrow format. The longer we continue to develop computation libraries
without considering those schemes, the harder it will be to add them.

[1]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html


On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou <solip...@pitrou.net>
> wrote:
> >
> > On Fri, 12 Jul 2019 20:37:15 -0700
> > Micah Kornfield <emkornfi...@gmail.com> wrote:
> > >
> > > If the latter, I wonder why Parquet cannot simply be used instead of
> > > > reinventing something similar but different.
> > >
> > > This is a reasonable point.  However there is  continuum here between
> file
> > > size and read and write times.  Parquet will likely always be the
> smallest
> > > with the largest times to convert to and from Arrow.  An uncompressed
> > > Feather/Arrow file will likely always take the most space but will much
> > > faster conversion times.
> >
> > I'm curious whether the Parquet conversion times are inherent to the
> > Parquet format or due to inefficiencies in the implementation.
> >
>
> Parquet is fundamentally more complex to decode. Consider several
> layers of logic that must happen for values to end up in the right
> place
>
> * Data pages are usually compressed, and a column consists of many
> data pages each having a Thrift header that must be deserialized
> * Values are usually dictionary-encoded, dictionary indices are
> encoded using hybrid bit-packed / RLE scheme
> * Null/not-null is encoded in definition levels
> * Only non-null values are stored, so when decoding to Arrow, values
> have to be "moved into place"
>
> The current C++ implementation could certainly be made faster. One
> consideration with Parquet is that the files are much smaller, so when
> you are reading them over the network the effective end-to-end time
> including IO and deserialization will frequently win.
>
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to