teh-cmc commented on PR #6300:
URL: https://github.com/apache/arrow-rs/pull/6300#issuecomment-2323002688

   > I am concerned that this change will decrease performance by forcing an 
extra copy in all situations, though I may not understand the implications
   
   I wouldn't expect this to have a noticable impact on performance for two 
reasons:
   1. `MutableBuffer` already reallocates memory and copies data around as it 
grows in any case (that's the root of this issue to begin with, after all).
   2. I don't think the extra copy is a guarantee -- see my answer to the next 
question.
   
   > Does this force a copy in the case where the capacity is larger that the 
actual need?
   
   As far as I understand by looking at the code, at the end of the day all of 
this machinery bottoms out in `realloc()`.
   Whether this actually reallocates memory and copies data around should 
therefore be unspecified: it is left as an implementation detail of whichever 
`GlobalAllocator` is currently in use.
   
   I'd expect any decent allocator to do whatever is most efficient depending 
on the exact situation at hand.
   
   > Have you considered making the initial capacity calculations more accurate 
(I am not sure this is possible) so the delta between capacity and actual is 
lower
   
   That doesn't seem possible given the nature of `MutableBuffer` and how it is 
used across the different layers of the codebase.
   
   > This PR currently seems to have some CI failures
   
   I've updated the tests to reflect the new expected capacity values, so they 
should now pass.
   
   Beware that I am changing these values with absolutely zero knowledge of the 
context in which these tests were written though: it would be fair to say that 
I have effectively no idea what I'm doing.
   
   ---
   
   That's it for the theory... now for the practice.
   
   I have found two relevant benchmark suites that relate to this change:
   * `--bench concatenate_kernel`
   * `--bench mutable_array`
   
   ### `--bench concatenate_kernel`
   
   Performance looks roughly on pair with `master` for this benchmark suite. In 
fact it even looks like a slight overall improvement, for some reason.
   
   ```
   taskset -c 7 cargo bench -p arrow --bench concatenate_kernel 
--features="test_utils"
   
   concat i32 1024         time:   [453.90 ns 454.31 ns 454.75 ns]
                           change: [-3.7808% -3.5907% -3.4059%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   concat i32 nulls 1024   time:   [652.64 ns 652.88 ns 653.18 ns]
                           change: [+0.0226% +0.2489% +0.4726%] (p = 0.03 < 
0.05)
                           Change within noise threshold.
   
   concat 1024 arrays i32 4
                           time:   [109.22 µs 109.38 µs 109.57 µs]
                           change: [-7.2900% -6.8379% -6.4195%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   concat str 1024         time:   [9.3471 µs 9.3609 µs 9.3764 µs]
                           change: [-0.1653% +0.4566% +1.0820%] (p = 0.16 > 
0.05)
                           No change in performance detected.
   
   concat str nulls 1024   time:   [5.7129 µs 5.7328 µs 5.7480 µs]
                           change: [-4.0256% -3.3280% -2.6276%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   concat str_dict 1024    time:   [2.4110 µs 2.4143 µs 2.4177 µs]
                           change: [+2.5143% +2.6708% +2.8211%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   concat str_dict_sparse 1024
                           time:   [4.7248 µs 4.7301 µs 4.7371 µs]
                           change: [+1.7168% +1.8403% +1.9509%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   concat str nulls 1024 #2
                           time:   [5.7316 µs 5.7364 µs 5.7415 µs]
                           change: [-3.8916% -3.0466% -2.2002%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   concat fixed size lists time:   [136.77 µs 136.97 µs 137.18 µs]
                           change: [-0.5076% -0.1102% +0.2847%] (p = 0.63 > 
0.05)
                           No change in performance detected.
   ```
   
   ### `--bench mutable_array` 
   
   This one is a different beast -- it is _much_ slower on this branch:
   ```
   mutable str 1024        time:   [31.213 ms 31.302 ms 31.420 ms]
                           change: [+73.378% +74.185% +74.973%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   mutable str nulls 1024  time:   [11.210 ms 11.273 ms 11.376 ms]
                           change: [-13.933% -13.406% -12.652%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   which means my first assumption above was wrong: you can certainly feel the 
extra copy if the right conditions are met, despite `MutableBuffer` doing a 
bunch of its own.
   
   That's with the default system allocator though, which is notoriously not a 
good idea... Running these benchmarks with `mimalloc` tells a very different 
story:
   ```diff
   diff --git a/arrow/benches/mutable_array.rs b/arrow/benches/mutable_array.rs
   index b04e5cd84..d9b388e68 100644
   --- a/arrow/benches/mutable_array.rs
   +++ b/arrow/benches/mutable_array.rs
   @@ -15,6 +15,11 @@
    // specific language governing permissions and limitations
    // under the License.
    
   +use mimalloc::MiMalloc;
   +
   +#[global_allocator]
   +static GLOBAL: MiMalloc = MiMalloc;
   +
    #[macro_use]
    extern crate criterion;
    use criterion::Criterion;
   ```
   
   Results:
   ```
   mutable str 1024        time:   [11.755 ms 11.790 ms 11.825 ms]
                           change: [+0.1332% +0.5256% +0.9202%] (p = 0.01 < 
0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   mutable str nulls 1024  time:   [5.8176 ms 5.8273 ms 5.8385 ms]
                           change: [-1.0864% -0.7649% -0.4585%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   ```
   which means my second assumption was correct: it is left to the allocator 
implementation to do the right thing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to