[GitHub] [arrow] emkornfield commented on pull request #4815: [DISCUSS] Add strawman proposal for sparseness and data integrity

GitBox Mon, 26 Apr 2021 08:34:28 -0700


emkornfield commented on pull request #4815:
URL: https://github.com/apache/arrow/pull/4815#issuecomment-826935381



   @alamb thanks for the comments.
   
   > You would encode that, like a dictionary array with one buffer of values 
([1,3]) and another buffer of run lengths ([2, 4]).
   
   > This proposal seems to add another dimension (as a new type of 
SparseRecordBatch at a lower level. )
   
   Hi Andrew, thanks.  The encoding here is slightly different then encoding 
the lengths, it encoded [cumulative run 
lengths](https://github.com/apache/arrow/pull/4815/files#r308298514).  The main 
reason for proposing using cumulative values is is still allows for sublinear 
but not O(1) access to elements in a batch.  So instead of the run lengths 
`[2,4]` you would have `[2, 6]`.
   
   The new message type SparseRecordBatch is motivated by two related concerns:
   1.  Previously extra metadata was found to increase the overhead on 
RecordBatch was found to be too high for some scenarios.  Adding a new RLE type 
probably wouldn't be too bad in terms of metadata cost, but I thinking having a 
more general extensible framework would serve us better in the long run. 
   2.  Adding a new message type makes it less likely to break existing IPC 
code paths and can be bolted on (there are pros and cons to this).
   
   So if the sole goal is supporting RLE I could see potentially modifying 
existing RecordBatches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] emkornfield commented on pull request #4815: [DISCUSS] Add strawman proposal for sparseness and data integrity

Reply via email to