felipecrv opened a new issue, #34237:
URL: https://github.com/apache/arrow/issues/34237

   ### Describe the enhancement requested
   
   ## Hypothesis
   
   Limiting the length of runs (not the length of logical arrays) is going to 
prevent multi-language integration pains at a very low storage/memory cost — 
compresssing ~2B elements into a single run already yields a great compression 
factor. If we need to produce longer runs we can simply append multiple `< 
INT_MAX`-sized runs.
   
   ## The spec on array lengths
   
   https://arrow.apache.org/docs/dev/format/Columnar.html#array-lengths
   
   > Array lengths are represented in the Arrow metadata as a 64-bit signed 
integer. An implementation of Arrow is considered valid even if it only 
supports lengths up to the maximum 32-bit signed integer, though. If using 
Arrow in a multi-language environment, we recommend limiting lengths to 2 31 - 
1 elements or less. Larger data sets can be represented using multiple array 
chunks.
   
   The solution proposed by the spec for languages that don't support 64-bit 
integers is to use multiple array chunks. Chunking the physical arrays of a 
run-end encoded logical array is much easier when we don't have to split in the 
middle of a run. So limiting the runs at `INT_MAX` means we only have to worry 
about the regular split of logical length.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to