alamb opened a new issue, #7699: URL: https://github.com/apache/arrow-rs/issues/7699
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** The Variant spec uses different numbers of bytes for encoding / writing small/large arrays For example, for an array, the encoding looks like this (note the num_elements is either 1 or 4 bytes): https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-array-basic_type3 > The size in bytes of `num_elements` is indicated by `is_large` in the value_header. Likewise, the number of bytes used for a field_offset depends on the total number of elements in the array ``` 7 0 +-----------------------+ array value_data | | : num_elements : <-- unsigned little-endian, 1 or 4 bytes | | +-----------------------+ | | : field_offset : <-- unsigned little-endian, `field_offset_size` bytes | | +-----------------------+ : +-----------------------+ | | : field_offset : <-- unsigned little-endian, `field_offset_size` bytes | | (`num_elements + 1` field_offsets) +-----------------------+ | | : value : | | +-----------------------+ : +-----------------------+ | | : value : <-- (`num_elements` values) | | +-----------------------+ ``` As described by @scovich on @PinkCrow007's PR: https://github.com/apache/arrow-rs/pull/7653#discussion_r2147173482 > he value offset and field id arrays require either knowing the number of elements/fields to be created in advance (and then worrying about what happens if the caller builds too many/few entries afterward), or building the arrays in separate storage and then moving an arbitrarily large number of buffered bytes to make room for the them after the fact. A similar issue exists for Objects. Hopefully by designing a pattern for Arrays we'll then also have a way to implement it for `Objects` as well **Describe the solution you'd like** I would like 1. Examples of creating Arrays with more than 256 values (the number of offsets that can be encoded in a u8) 2. APIs that allow efficient construction of such Array values **Describe alternatives you've considered** Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finished. This would get tricky for building "large" lists as the length field may not be known upfront. ## Specialized Functions we could also introduce potentially a function like `new_large_object()` or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying I think many clients would have knowledge of the number of fields and could then decide on the appropriate API **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
