This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new 3e089d2c26 Perf: optimize actual_buffer_size to use only data buffer 
capacity for coalesce (#7967)
3e089d2c26 is described below

commit 3e089d2c26e036b42246351153e4724c71c46e2d
Author: Qi Zhu <[email protected]>
AuthorDate: Thu Jul 24 01:06:48 2025 +0800

    Perf: optimize actual_buffer_size to use only data buffer capacity for 
coalesce (#7967)
    
    # Which issue does this PR close?
    
    This is a very interesting idea that we only calculate the data buffer
    size when we choose to gc, because we almost only care about the gc for
    data buffers, not for other field views/nulls.
    
    GC is only for databuffers, so the *2 calculation should also compare
    the databuffer size?
    
    
    # Rationale for this change
    
    optimize actual_buffer_size to use only data buffer capacity
    
    # What changes are included in this PR?
    
    optimize actual_buffer_size to use only data buffer capacity
    
    # Are these changes tested?
    
    The performance improvement for some high select benchmark with low null
    ratio is very good about 2X fast:
    
    ```rust
    cargo bench --bench coalesce_kernels "single_utf8view"
       Compiling arrow-select v55.2.0 (/Users/zhuqi/arrow-rs/arrow-select)
       Compiling arrow-cast v55.2.0 (/Users/zhuqi/arrow-rs/arrow-cast)
       Compiling arrow-string v55.2.0 (/Users/zhuqi/arrow-rs/arrow-string)
       Compiling arrow-ord v55.2.0 (/Users/zhuqi/arrow-rs/arrow-ord)
       Compiling arrow-csv v55.2.0 (/Users/zhuqi/arrow-rs/arrow-csv)
       Compiling arrow-json v55.2.0 (/Users/zhuqi/arrow-rs/arrow-json)
       Compiling arrow v55.2.0 (/Users/zhuqi/arrow-rs/arrow)
        Finished `bench` profile [optimized] target(s) in 13.26s
         Running benches/coalesce_kernels.rs 
(target/release/deps/coalesce_kernels-bb9750abedb10ad6)
    filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001
                            time:   [30.946 ms 31.071 ms 31.193 ms]
                            change: [−1.7086% −1.1581% −0.6036%] (p = 0.00 < 
0.05)
                            Change within noise threshold.
    Found 5 outliers among 100 measurements (5.00%)
      4 (4.00%) low mild
      1 (1.00%) high mild
    
    filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01
                            time:   [3.8178 ms 3.8311 ms 3.8444 ms]
                            change: [−4.0521% −3.5467% −3.0345%] (p = 0.00 < 
0.05)
                            Performance has improved.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    
    Benchmarking filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1: 
Warming up for 3.0000 s
    Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 9.9s, enable flat sampling, or reduce sample count to 40.
    filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1
                            time:   [1.9337 ms 1.9406 ms 1.9478 ms]
                            change: [+0.3699% +0.9557% +1.5666%] (p = 0.00 < 
0.05)
                            Change within noise threshold.
    Found 5 outliers among 100 measurements (5.00%)
      2 (2.00%) low mild
      3 (3.00%) high severe
    
    filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8
                            time:   [797.60 µs 805.31 µs 813.85 µs]
                            change: [−59.177% −58.412% −57.639%] (p = 0.00 < 
0.05)
                            Performance has improved.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001
                            time:   [43.742 ms 43.924 ms 44.108 ms]
                            change: [−1.2146% −0.5778% +0.0828%] (p = 0.08 > 
0.05)
                            No change in performance detected.
    
    filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01
                            time:   [5.5736 ms 5.5987 ms 5.6247 ms]
                            change: [−0.2381% +0.4740% +1.1711%] (p = 0.18 > 
0.05)
                            No change in performance detected.
    
    filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1
                            time:   [2.2963 ms 2.3035 ms 2.3109 ms]
                            change: [−0.9314% −0.5125% −0.0931%] (p = 0.02 < 
0.05)
                            Change within noise threshold.
    
    Benchmarking filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8: 
Warming up for 3.0000 s
    Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 8.1s, enable flat sampling, or reduce sample count to 50.
    filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8
                            time:   [1.5482 ms 1.5697 ms 1.5903 ms]
                            change: [−45.794% −44.386% −43.000%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    ```
    
    If tests are not included in your PR, please explain why (for example,
    are they covered by existing tests)?
    
    # Are there any user-facing changes?
    
    If there are user-facing changes then we may require documentation to be
    updated before approving the PR.
    
    If there are any breaking changes to public APIs, please call them out.
    
    ---------
    
    Co-authored-by: Andrew Lamb <[email protected]>
---
 arrow-select/src/coalesce.rs           | 42 ++++++++++++++++++----------------
 arrow-select/src/coalesce/byte_view.rs |  5 +++-
 2 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/arrow-select/src/coalesce.rs b/arrow-select/src/coalesce.rs
index 37741de3bc..891d62fc3a 100644
--- a/arrow-select/src/coalesce.rs
+++ b/arrow-select/src/coalesce.rs
@@ -785,21 +785,27 @@ mod tests {
 
     #[test]
     fn test_string_view_many_small_compact() {
-        // The strings are 28 long, so each batch has 400 * 28 = 5600 bytes
+        // 200 rows alternating long (28) and short (≤12) strings.
+        // Only the 100 long strings go into data buffers: 100 × 28 = 2800.
         let batch = stringview_batch_repeated(
-            400,
+            200,
             [Some("This string is 28 bytes long"), Some("small string")],
         );
         let output_batches = Test::new()
             // First allocated buffer is 8kb.
-            // Appending five batches of 5600 bytes will use 5600 * 5 = 28kb 
(8kb, an 16kb and 32kbkb)
+            // Appending 10 batches of 2800 bytes will use 2800 * 10 = 14kb 
(8kb, an 16kb and 32kbkb)
+            .with_batch(batch.clone())
+            .with_batch(batch.clone())
+            .with_batch(batch.clone())
+            .with_batch(batch.clone())
+            .with_batch(batch.clone())
             .with_batch(batch.clone())
             .with_batch(batch.clone())
             .with_batch(batch.clone())
             .with_batch(batch.clone())
             .with_batch(batch.clone())
             .with_batch_size(8000)
-            .with_expected_output_sizes(vec![2000]) // only 2000 rows total
+            .with_expected_output_sizes(vec![2000]) // only 1000 rows total
             .run();
 
         // expect a nice even distribution of buffers
@@ -854,14 +860,14 @@ mod tests {
 
     #[test]
     fn test_string_view_large_small() {
-        // The strings are 37 bytes long, so each batch has 200 * 28 = 5600 
bytes
+        // The strings are 37 bytes long, so each batch has 100 * 28 = 2800 
bytes
         let mixed_batch = stringview_batch_repeated(
-            400,
+            200,
             [Some("This string is 28 bytes long"), Some("small string")],
         );
         // These strings aren't copied, this array has an 8k buffer
         let all_large = stringview_batch_repeated(
-            100,
+            50,
             [Some(
                 "This buffer has only large strings in it so there are no 
buffer copies",
             )],
@@ -869,7 +875,12 @@ mod tests {
 
         let output_batches = Test::new()
             // First allocated buffer is 8kb.
-            // Appending five batches of 5600 bytes will use 5600 * 5 = 28kb 
(8kb, an 16kb and 32kbkb)
+            // Appending five batches of 2800 bytes will use 2800 * 10 = 28kb 
(8kb, an 16kb and 32kbkb)
+            .with_batch(mixed_batch.clone())
+            .with_batch(mixed_batch.clone())
+            .with_batch(all_large.clone())
+            .with_batch(mixed_batch.clone())
+            .with_batch(all_large.clone())
             .with_batch(mixed_batch.clone())
             .with_batch(mixed_batch.clone())
             .with_batch(all_large.clone())
@@ -883,26 +894,17 @@ mod tests {
             col_as_string_view("c0", output_batches.first().unwrap()),
             vec![
                 ExpectedLayout {
-                    len: 8176,
+                    len: 8190,
                     capacity: 8192,
                 },
-                // this buffer was allocated but not used when the all_large 
batch was pushed
                 ExpectedLayout {
-                    len: 3024,
+                    len: 16366,
                     capacity: 16384,
                 },
                 ExpectedLayout {
-                    len: 7000,
-                    capacity: 8192,
-                },
-                ExpectedLayout {
-                    len: 5600,
+                    len: 6244,
                     capacity: 32768,
                 },
-                ExpectedLayout {
-                    len: 7000,
-                    capacity: 8192,
-                },
             ],
         );
     }
diff --git a/arrow-select/src/coalesce/byte_view.rs 
b/arrow-select/src/coalesce/byte_view.rs
index 00b2210cb8..6d3bcc8ae0 100644
--- a/arrow-select/src/coalesce/byte_view.rs
+++ b/arrow-select/src/coalesce/byte_view.rs
@@ -284,7 +284,10 @@ impl<B: ByteViewType> InProgressArray for 
InProgressByteViewArray<B> {
                 (false, 0)
             } else {
                 let ideal_buffer_size = s.total_buffer_bytes_used();
-                let actual_buffer_size = s.get_buffer_memory_size();
+                // We don't use get_buffer_memory_size here, because gc is for 
the contents of the
+                // data buffers, not views and nulls.
+                let actual_buffer_size =
+                    s.data_buffers().iter().map(|b| 
b.capacity()).sum::<usize>();
                 // copying strings is expensive, so only do it if the array is
                 // sparse (uses at least 2x the memory it needs)
                 let need_gc =

Reply via email to