alamb commented on code in PR #6136: URL: https://github.com/apache/arrow-rs/pull/6136#discussion_r1693939057
########## arrow-array/src/builder/generic_bytes_view_builder.rs: ########## @@ -585,4 +617,46 @@ mod tests { "Invalid argument error: No block found with index 5" ); } + + #[test] + fn test_string_view_with_block_size_growth() { Review Comment: ❤️ ########## arrow-array/src/builder/generic_bytes_view_builder.rs: ########## @@ -78,15 +100,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> { null_buffer_builder: NullBufferBuilder::new(capacity), completed: vec![], in_progress: vec![], - block_size: DEFAULT_BLOCK_SIZE, + block_size: BlockSizeGrowthStrategy::Exponential { + current_size: STARTING_BLOCK_SIZE, + }, string_tracker: None, phantom: Default::default(), } } - /// Override the size of buffers to allocate for holding string data + /// The block size is the size of the buffer used to store the string data. + /// A new buffer will be allocated when the current buffer is full. + /// By default the builder try to keep the buffer count low by growing the size exponentially from 8KB up to 2MB. + /// This method instead set a fixed value to the buffer size, useful for advanced users that want to control the memory usage and buffer count. + /// Check <https://github.com/apache/arrow-rs/issues/6094> for more details on the implications. Review Comment: I obsessed about the documentation a little. what do you think of this (I can make a follow on PR too) ```suggestion /// Set a fixed buffer size for variable length strings /// /// The block size is the size of the buffer used to store values greater /// than 12 bytes. The builder allocates new buffers when the current /// buffer is full. /// /// By default the builder balances buffer size and buffer count by /// growing buffer size exponentially from 8KB up to 2MB. The /// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB. /// /// If this method is used, any new buffers allocated are /// exactly this size. This can be useful for advanced users /// that want to control the memory usage and buffer count. /// /// See <https://github.com/apache/arrow-rs/issues/6094> for more details on the implications. ``` ########## arrow-array/src/builder/generic_bytes_view_builder.rs: ########## @@ -78,15 +100,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> { null_buffer_builder: NullBufferBuilder::new(capacity), completed: vec![], in_progress: vec![], - block_size: DEFAULT_BLOCK_SIZE, + block_size: BlockSizeGrowthStrategy::Exponential { + current_size: STARTING_BLOCK_SIZE, + }, string_tracker: None, phantom: Default::default(), } } - /// Override the size of buffers to allocate for holding string data + /// The block size is the size of the buffer used to store the string data. + /// A new buffer will be allocated when the current buffer is full. + /// By default the builder try to keep the buffer count low by growing the size exponentially from 8KB up to 2MB. + /// This method instead set a fixed value to the buffer size, useful for advanced users that want to control the memory usage and buffer count. + /// Check <https://github.com/apache/arrow-rs/issues/6094> for more details on the implications. pub fn with_block_size(self, block_size: u32) -> Self { Review Comment: Since we are now free to make API changes, maybe we could re-name this function `with_fixed_block_size` 🤔 Or even nicer would be to mark `with_block_size` deprecated (and call `with_fixed_block_size`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org