Strange, I've pasted the contents into a google document at [1] [1] https://docs.google.com/document/d/1uJzWh63Iqk7FRbElHPhHrsmlfe0NIJ6M8-0kejPmwIw/edit
On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <jacq...@apache.org> wrote: > Hey Micah, you're formatting seems to be messed up on this mail. Some kind > of copy/paste error? > > On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hi Arrow-dev, > > > > I’d like to make a straw-man proposal to cover some features that I think > > would be useful to Arrow, and that I would like to make a > proof-of-concept > > implementation for in Java and C++. In particular, the proposal covers > > allowing for smaller data sizes via compression and encoding [1][2][8], > > data integrity [3] and avoiding unnecessary data transfer [4][5]. > > > > I’ve put together a PR [6] that has proposed changes to the flatbuffer > > metadata to support the new features. The PR introduces: > > > > - > > > > A new “SparseRecordBatch” that can support one of multiple possible > > encodings (both dense and sparse), compression and column elision. > > - > > > > A “Digest” message type to support optional data integrity. > > > > > > Going into more details on the specific features in the PR: > > > > 1. > > > > Sparse encodings for arrays and buffers. The guiding principles > behind > > the suggested encodings are to support encodings that can be exploited > > by > > compute engines for more efficient computation (I don’t think parquet > > style > > bit-packing belongs in Arrow). While the encodings don’t maintain > O(1) > > data element access, they support sublinear, O(log(N)), element > access. > > The > > suggested encodings are: > > 1. > > > > Array encodings: > > 1. > > > > Add a run-length encoding scheme to efficiently represent > repeated > > values (the actual scheme encodes run ends instead of length > > to preserve > > sub-linear random access). > > 2. > > > > Add a “packed” sparse representation (null values don’t take up > > space in value buffers) > > 2. > > > > Buffer encodings: > > 1. > > > > Add frame of reference integer encoding [7] (this allows for > lower > > bit-width encoding of integer types by subtracting a > > “reference” value from > > all values in the buffer). > > 2. > > > > Add a sparse integer set encoding. This encoding allows more > > efficient encoding of validity bit-masks for cases when all > > values are > > either null or not null. > > 2. > > > > Data compression. Similar to encodings but compression is solely for > > reduction of data at rest/on the wire. The proposal is to allow > > compression of individual buffers. Right now zstd is proposed, but I > > don’t > > feel strongly on the specific technologies here. > > 3. > > > > Column Elision. For some use-cases, like structured logging, the > > overhead of including array metadata for columns with no data present > > represents non-negligible overhead. The proposal provides a > mechanism > > for > > omitting meta-data for such arrays. > > 4. > > > > Data Integrity. While the arrow file format isn’t meant for archiving > > data, I think it is important to allow for optional native data > > integrity > > checks in the format. To this end, I proposed a new “Digest” message > > type > > that can be added after other messages to record a digest/hash of the > > preceding data. I suggested xxhash, but I don’t have a strong opinion > > here, > > as long as there is some minimal support that can potentially be > > expanded > > later. > > > > > > In the proposal I chose to use Tables and Unions everywhere for > flexibility > > but in all likelihood some could be replaced by enums. > > > > My initial plan would be to solely focus on an IPC mechanism that can > send > > a SparseRecordBatch and immediately translate it to a normal RecordBatch > in > > both Java and C++. > > > > As a practical matter the proposal represents a lot of work to get an MVP > > working in time for 1.0.0 release (provided they are accepted by the > > community), so I'd greatly appreciate if anyone wants to collaborate on > > this. > > > > If it is easier I’m happy to start a separate thread for feature if > people > > feel like it would make the conversation easier. I can also create a > > Google Doc for direct comments if that is preferred. > > > > Thanks, > > > > Micah > > > > > > > > P.S. In the interest of full disclosure, these ideas evolved in > > collaboration with Brian Hulette and other colleagues at Google who are > > interested in making use of Arrow in both internal and external projects. > > > > [1] https://issues.apache.org/jira/browse/ARROW-300 > > > > [2] https://issues.apache.org/jira/browse/ARROW-5224 > > > > [3] > > > > > https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E > > > > [4] > > > > > https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E > > > > [5] > > > > > https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812 > > > > [6] https://github.com/apache/arrow/pull/4815 > > > > [7] > > > > > https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/ > > > > [8] https://issues.apache.org/jira/browse/ARROW-5821 > > >