neilconway opened a new issue, #9435:
URL: https://github.com/apache/arrow-rs/issues/9435

   ## Describe the feature
   
   `GenericByteViewArray` has `total_buffer_bytes_used()`, which sums the byte 
lengths of strings stored out-of-line in data buffers (i.e., strings > 12 
bytes). However, there is no method that returns the total byte length of 
**all** strings, including those inlined in the views (≤ 12 bytes).
   
   For `GenericByteArray` (e.g., `StringArray`), the equivalent is trivially 
available via `values().len()`, which returns the total byte length of all 
string data in O(1).
   
   ## Use Case
   
   In [Apache DataFusion](https://github.com/apache/datafusion), we use this 
kind of size estimate to pre-allocate output buffers when implementing string 
functions like `concat()` and `concat_ws()`. When the input is a 
`StringViewArray`, we currently use `total_buffer_bytes_used()` as a capacity 
hint, but this underestimates when the array contains short strings — 
potentially returning 0 if all strings are inlined.
   
   See discussion: https://github.com/apache/datafusion/pull/20317
   
   ## Proposed API
   
   A method on `GenericByteViewArray` such as:
   
   ```rust
   /// Returns the total byte length of all strings in the array,
   /// including both inlined and out-of-line strings.
   pub fn total_bytes_len(&self) -> usize {
       self.views()
           .iter()
           .map(|v| (*v as u32) as usize)
           .sum()
   }
   ```
   
   This would be the StringView equivalent of 
`GenericByteArray::values().len()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to