alamb commented on code in PR #6136:
URL: https://github.com/apache/arrow-rs/pull/6136#discussion_r1693939057


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -585,4 +617,46 @@ mod tests {
             "Invalid argument error: No block found with index 5"
         );
     }
+
+    #[test]
+    fn test_string_view_with_block_size_growth() {

Review Comment:
   ❤️ 



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -78,15 +100,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
             null_buffer_builder: NullBufferBuilder::new(capacity),
             completed: vec![],
             in_progress: vec![],
-            block_size: DEFAULT_BLOCK_SIZE,
+            block_size: BlockSizeGrowthStrategy::Exponential {
+                current_size: STARTING_BLOCK_SIZE,
+            },
             string_tracker: None,
             phantom: Default::default(),
         }
     }
 
-    /// Override the size of buffers to allocate for holding string data
+    /// The block size is the size of the buffer used to store the string data.
+    /// A new buffer will be allocated when the current buffer is full.
+    /// By default the builder try to keep the buffer count low by growing the 
size exponentially from 8KB up to 2MB.
+    /// This method instead set a fixed value to the buffer size, useful for 
advanced users that want to control the memory usage and buffer count.
+    /// Check <https://github.com/apache/arrow-rs/issues/6094> for more 
details on the implications.

Review Comment:
   I obsessed about the documentation a little. what do you think of this (I 
can make a follow on PR too)
   
   ```suggestion
       /// Set a fixed buffer size for variable length strings
       ///
       /// The block size is the size of the buffer used to store values greater
       /// than 12 bytes. The builder allocates new buffers when the current 
       /// buffer is full.
       ///
       /// By default the builder balances buffer size and buffer count by
       /// growing buffer size exponentially from 8KB up to 2MB. The 
       /// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB. 
       ///
       /// If this method is used, any new buffers allocated are  
       /// exactly this size. This can be useful for advanced users 
       /// that want to control the memory usage and buffer count.
       ///
       /// See <https://github.com/apache/arrow-rs/issues/6094> for more 
details on the implications.
   ```



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -78,15 +100,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
             null_buffer_builder: NullBufferBuilder::new(capacity),
             completed: vec![],
             in_progress: vec![],
-            block_size: DEFAULT_BLOCK_SIZE,
+            block_size: BlockSizeGrowthStrategy::Exponential {
+                current_size: STARTING_BLOCK_SIZE,
+            },
             string_tracker: None,
             phantom: Default::default(),
         }
     }
 
-    /// Override the size of buffers to allocate for holding string data
+    /// The block size is the size of the buffer used to store the string data.
+    /// A new buffer will be allocated when the current buffer is full.
+    /// By default the builder try to keep the buffer count low by growing the 
size exponentially from 8KB up to 2MB.
+    /// This method instead set a fixed value to the buffer size, useful for 
advanced users that want to control the memory usage and buffer count.
+    /// Check <https://github.com/apache/arrow-rs/issues/6094> for more 
details on the implications.
     pub fn with_block_size(self, block_size: u32) -> Self {

Review Comment:
   Since we are now free to make API changes, maybe we could re-name this 
function `with_fixed_block_size` 🤔 
   
   Or even nicer would be to mark `with_block_size` deprecated (and call 
`with_fixed_block_size`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to