I think that lightweight encodings (like the FrameOfReference Micah suggests) do make a lot of sense for Arrow. There are a few implementations of those in commercial systems. One related paper in the literature is http://www.cs.columbia.edu/~orestis/damon15.pdf
I would actually also look into some order-preserving dictionary encodings for strings that also allow vectorized processing (predicates, joins, ..) on encoded data, e.g. see https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf . Arrow can be used as during shuffles for distributed joins/aggs and being able to operate on encoded data yields benefits (e.g. http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). Thanks, -Ippokratis. On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > It's not just computation libraries, it's any library peeking inside > > Arrow data. Currently, the Arrow data types are simple, which makes it > > easy and non-intimidating to build data processing utilities around > > them. If we start adding sophisticated encodings, we also raise the > > cost of supporting Arrow for third-party libraries. > > > This is another legitimate concern about complexity. > > To try to limit complexity. I simplified the proposal PR [1] to only have 1 > buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding > scheme (RLE) that I think will have the most benefit if exploited > properly. Compression is removed. > > I'd like to get closure on the proposal one way or another. I think now > the question to be answered is if we are willing to introduce the > additional complexity for the performance improvements they can yield? Is > there more data that people would like to see that would influence their > decision? > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/4815 > > On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <solip...@pitrou.net> > wrote: > > > On Mon, 22 Jul 2019 08:40:08 -0700 > > Brian Hulette <hulet...@gmail.com> wrote: > > > To me, the most important aspect of this proposal is the addition of > > sparse > > > encodings, and I'm curious if there are any more objections to that > > > specifically. So far I believe the only one is that it will make > > > computation libraries more complicated. This is absolutely true, but I > > > think it's worth that cost. > > > > It's not just computation libraries, it's any library peeking inside > > Arrow data. Currently, the Arrow data types are simple, which makes it > > easy and non-intimidating to build data processing utilities around > > them. If we start adding sophisticated encodings, we also raise the > > cost of supporting Arrow for third-party libraries. > > > > Regards > > > > Antoine. > > > > > > > -- -Ippokratis.