Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Micah Kornfield Fri, 05 Jul 2019 13:33:34 -0700

Hi Jacques,
Thanks for the quick response.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.



I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.    But if the features are implemented in Java and
C++ with integration tests between the two in time for 1.0 should we
explicitly rule it out?  If not for 1.0 would the subsequent release make
sense?

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.


Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use potentially use more specialized
compression techniques that don't require larger sized data to be
effective.  For example there is a JIRA open to consider using  PFOR [1]
which if I understand correctly starts being effective once you have ~128
integers.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor




On Fri, Jul 5, 2019 at 12:38 PM Jacques Nadeau <[email protected]> wrote:

> Initial thought: I don't think most of this should be targeted for 1.0. It
> is a lot of change/enhancement and seems like it would likely substantially
> delay 1.0. The one piece that seems least disruptive would be basic on the
> wire compression. You suggested that this be done on the buffer level but
> it seems like that maybe too narrow depending on batch size? What is the
> thinking here about tradeoffs around message versus batch. When pipelining,
> we target relatively small batches typically of 256k-1mb. Sometimes we
> might go up to 10mb but that is a pretty rare use case.
>
> On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <[email protected]> wrote:
>
>> Hey Micah, you're formatting seems to be messed up on this mail. Some
>> kind of copy/paste error?
>>
>> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Arrow-dev,
>>>
>>> I’d like to make a straw-man proposal to cover some features that I think
>>> would be useful to Arrow, and that I would like to make a
>>> proof-of-concept
>>> implementation for in Java and C++.  In particular, the proposal covers
>>> allowing for smaller data sizes via compression and encoding [1][2][8],
>>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>>
>>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>>> metadata to support the new features.  The PR introduces:
>>>
>>>    -
>>>
>>>    A new “SparseRecordBatch” that can support one of multiple possible
>>>    encodings (both dense and sparse), compression and column elision.
>>>    -
>>>
>>>    A “Digest” message type to support optional data integrity.
>>>
>>>
>>> Going into more details on the specific features in the PR:
>>>
>>>    1.
>>>
>>>    Sparse encodings for arrays and buffers.  The guiding principles
>>> behind
>>>    the suggested encodings are to support encodings that can be
>>> exploited by
>>>    compute engines for more efficient computation (I don’t think parquet
>>> style
>>>    bit-packing belongs in Arrow).  While the encodings don’t maintain
>>> O(1)
>>>    data element access, they support sublinear, O(log(N)), element
>>> access. The
>>>    suggested encodings are:
>>>    1.
>>>
>>>       Array encodings:
>>>       1.
>>>
>>>          Add a run-length encoding scheme to efficiently represent
>>> repeated
>>>          values (the actual scheme encodes run ends instead of length
>>> to preserve
>>>          sub-linear random access).
>>>          2.
>>>
>>>          Add a “packed” sparse representation (null values don’t take up
>>>          space in value buffers)
>>>          2.
>>>
>>>       Buffer encodings:
>>>       1.
>>>
>>>          Add frame of reference integer encoding [7] (this allows for
>>> lower
>>>          bit-width encoding of integer types by subtracting a
>>> “reference” value from
>>>          all values in the buffer).
>>>          2.
>>>
>>>          Add a sparse integer set encoding.  This encoding allows more
>>>          efficient encoding of validity bit-masks for cases when all
>>> values are
>>>          either null or not null.
>>>          2.
>>>
>>>    Data compression.  Similar to encodings but compression is solely for
>>>    reduction of data at rest/on the wire.  The proposal is to allow
>>>    compression of individual buffers. Right now zstd is proposed, but I
>>> don’t
>>>    feel strongly on the specific technologies here.
>>>    3.
>>>
>>>    Column Elision.  For some use-cases, like structured logging, the
>>>    overhead of including array metadata for columns with no data present
>>>    represents non-negligible overhead.   The proposal provides a
>>> mechanism for
>>>    omitting meta-data for such arrays.
>>>    4.
>>>
>>>    Data Integrity.  While the arrow file format isn’t meant for archiving
>>>    data, I think it is important to allow for optional native data
>>> integrity
>>>    checks in the format.  To this end, I proposed a new “Digest” message
>>> type
>>>    that can be added after other messages to record a digest/hash of the
>>>    preceding data. I suggested xxhash, but I don’t have a strong opinion
>>> here,
>>>    as long as there is some minimal support that can potentially be
>>> expanded
>>>    later.
>>>
>>>
>>> In the proposal I chose to use Tables and Unions everywhere for
>>> flexibility
>>> but in all likelihood some could be replaced by enums.
>>>
>>> My initial plan would be to solely focus on an IPC mechanism that can
>>> send
>>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>>> in
>>> both Java and C++.
>>>
>>> As a practical matter the proposal represents a lot of work to get an MVP
>>> working in time for 1.0.0 release (provided they are accepted by the
>>> community), so I'd greatly appreciate if anyone wants to collaborate on
>>> this.
>>>
>>> If it is easier I’m happy to start a separate thread for feature if
>>> people
>>> feel like it would make the conversation easier.  I can also create a
>>> Google Doc for direct comments if that is preferred.
>>>
>>> Thanks,
>>>
>>> Micah
>>>
>>>
>>>
>>> P.S. In the interest of full disclosure, these ideas evolved in
>>> collaboration with Brian Hulette and other colleagues at Google who are
>>> interested in making use of Arrow in both internal and external projects.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-300
>>>
>>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>>>
>>> [3]
>>>
>>> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>>>
>>> [4]
>>>
>>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>>
>>> [5]
>>>
>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>>>
>>> [6] https://github.com/apache/arrow/pull/4815
>>>
>>> [7]
>>>
>>> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>>>
>>> [8] https://issues.apache.org/jira/browse/ARROW-5821
>>>
>>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to