Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Ippokratis Pandis Sun, 25 Aug 2019 00:04:09 -0700

I think that lightweight encodings (like the FrameOfReference Micah
suggests) do make a lot of sense for Arrow. There are a few implementations
of those in commercial systems. One related paper in the literature is
http://www.cs.columbia.edu/~orestis/damon15.pdf


I would actually also look into some order-preserving dictionary encodings
for strings that also allow vectorized processing (predicates, joins, ..)
on encoded data, e.g. see
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
 .

Arrow can be used as during shuffles for distributed joins/aggs and being
able to operate on encoded data yields benefits (e.g.
http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

Thanks,
-Ippokratis.


On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
>
>
> This is another legitimate concern about complexity.
>
> To try to limit complexity. I simplified the proposal PR [1] to only have 1
> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
> scheme (RLE) that I think will have the most benefit if exploited
> properly.  Compression is removed.
>
> I'd like to get closure on the proposal one way or another.  I think now
> the question to be answered is if we are willing to introduce the
> additional complexity for the performance improvements they can yield?  Is
> there more data that people would like to see that would influence their
> decision?
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/4815
>
> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <solip...@pitrou.net>
> wrote:
>
> > On Mon, 22 Jul 2019 08:40:08 -0700
> > Brian Hulette <hulet...@gmail.com> wrote:
> > > To me, the most important aspect of this proposal is the addition of
> > sparse
> > > encodings, and I'm curious if there are any more objections to that
> > > specifically. So far I believe the only one is that it will make
> > > computation libraries more complicated. This is absolutely true, but I
> > > think it's worth that cost.
> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>


-- 
-Ippokratis.

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to