Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Micah Kornfield Fri, 05 Jul 2019 12:48:44 -0700

Strange, I've pasted the contents into a google document at [1]

[1]
https://docs.google.com/document/d/1uJzWh63Iqk7FRbElHPhHrsmlfe0NIJ6M8-0kejPmwIw/edit



On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <jacq...@apache.org> wrote:

> Hey Micah, you're formatting seems to be messed up on this mail. Some kind
> of copy/paste error?
>
> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Arrow-dev,
> >
> > I’d like to make a straw-man proposal to cover some features that I think
> > would be useful to Arrow, and that I would like to make a
> proof-of-concept
> > implementation for in Java and C++.  In particular, the proposal covers
> > allowing for smaller data sizes via compression and encoding [1][2][8],
> > data integrity [3] and avoiding unnecessary data transfer [4][5].
> >
> > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > metadata to support the new features.  The PR introduces:
> >
> >    -
> >
> >    A new “SparseRecordBatch” that can support one of multiple possible
> >    encodings (both dense and sparse), compression and column elision.
> >    -
> >
> >    A “Digest” message type to support optional data integrity.
> >
> >
> > Going into more details on the specific features in the PR:
> >
> >    1.
> >
> >    Sparse encodings for arrays and buffers.  The guiding principles
> behind
> >    the suggested encodings are to support encodings that can be exploited
> > by
> >    compute engines for more efficient computation (I don’t think parquet
> > style
> >    bit-packing belongs in Arrow).  While the encodings don’t maintain
> O(1)
> >    data element access, they support sublinear, O(log(N)), element
> access.
> > The
> >    suggested encodings are:
> >    1.
> >
> >       Array encodings:
> >       1.
> >
> >          Add a run-length encoding scheme to efficiently represent
> repeated
> >          values (the actual scheme encodes run ends instead of length
> > to preserve
> >          sub-linear random access).
> >          2.
> >
> >          Add a “packed” sparse representation (null values don’t take up
> >          space in value buffers)
> >          2.
> >
> >       Buffer encodings:
> >       1.
> >
> >          Add frame of reference integer encoding [7] (this allows for
> lower
> >          bit-width encoding of integer types by subtracting a
> > “reference” value from
> >          all values in the buffer).
> >          2.
> >
> >          Add a sparse integer set encoding.  This encoding allows more
> >          efficient encoding of validity bit-masks for cases when all
> > values are
> >          either null or not null.
> >          2.
> >
> >    Data compression.  Similar to encodings but compression is solely for
> >    reduction of data at rest/on the wire.  The proposal is to allow
> >    compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> >    feel strongly on the specific technologies here.
> >    3.
> >
> >    Column Elision.  For some use-cases, like structured logging, the
> >    overhead of including array metadata for columns with no data present
> >    represents non-negligible overhead.   The proposal provides a
> mechanism
> > for
> >    omitting meta-data for such arrays.
> >    4.
> >
> >    Data Integrity.  While the arrow file format isn’t meant for archiving
> >    data, I think it is important to allow for optional native data
> > integrity
> >    checks in the format.  To this end, I proposed a new “Digest” message
> > type
> >    that can be added after other messages to record a digest/hash of the
> >    preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> >    as long as there is some minimal support that can potentially be
> > expanded
> >    later.
> >
> >
> > In the proposal I chose to use Tables and Unions everywhere for
> flexibility
> > but in all likelihood some could be replaced by enums.
> >
> > My initial plan would be to solely focus on an IPC mechanism that can
> send
> > a SparseRecordBatch and immediately translate it to a normal RecordBatch
> in
> > both Java and C++.
> >
> > As a practical matter the proposal represents a lot of work to get an MVP
> > working in time for 1.0.0 release (provided they are accepted by the
> > community), so I'd greatly appreciate if anyone wants to collaborate on
> > this.
> >
> > If it is easier I’m happy to start a separate thread for feature if
> people
> > feel like it would make the conversation easier.  I can also create a
> > Google Doc for direct comments if that is preferred.
> >
> > Thanks,
> >
> > Micah
> >
> >
> >
> > P.S. In the interest of full disclosure, these ideas evolved in
> > collaboration with Brian Hulette and other colleagues at Google who are
> > interested in making use of Arrow in both internal and external projects.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-300
> >
> > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >
> > [3]
> >
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> >
> > [4]
> >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > [5]
> >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> >
> > [6] https://github.com/apache/arrow/pull/4815
> >
> > [7]
> >
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> >
> > [8] https://issues.apache.org/jira/browse/ARROW-5821
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to