felipecrv opened a new issue, #34237: URL: https://github.com/apache/arrow/issues/34237
### Describe the enhancement requested ## Hypothesis Limiting the length of runs (not the length of logical arrays) is going to prevent multi-language integration pains at a very low storage/memory cost — compresssing ~2B elements into a single run already yields a great compression factor. If we need to produce longer runs we can simply append multiple `< INT_MAX`-sized runs. ## The spec on array lengths https://arrow.apache.org/docs/dev/format/Columnar.html#array-lengths > Array lengths are represented in the Arrow metadata as a 64-bit signed integer. An implementation of Arrow is considered valid even if it only supports lengths up to the maximum 32-bit signed integer, though. If using Arrow in a multi-language environment, we recommend limiting lengths to 2 31 - 1 elements or less. Larger data sets can be represented using multiple array chunks. The solution proposed by the spec for languages that don't support 64-bit integers is to use multiple array chunks. Chunking the physical arrays of a run-end encoded logical array is much easier when we don't have to split in the middle of a run. So limiting the runs at `INT_MAX` means we only have to worry about the regular split of logical length. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
