tustvold commented on PR #4819:
URL: https://github.com/apache/arrow-rs/pull/4819#issuecomment-1721514440
Now that I got the benchmarks working correctly they make a lot more sense
:facepalm:...
With a low cardinality dictionary consisting of 10 distinct values, the
method in this PR still yields a performance advantage with moderate-sized
strings of less than 30 bytes. This relationship inverts once we get to strings
consisting of 100 bytes, with a non-trivial regression for strings of this size.
```
convert_columns 4096 string_dictionary_low_cardinality(10, 0)
time: [39.115 µs 39.128 µs 39.143 µs]
change: [-7.2399% -6.9400% -6.6487%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
convert_columns_prepared 4096 string_dictionary_low_cardinality(10, 0)
time: [37.562 µs 37.568 µs 37.574 µs]
change: [-8.4250% -8.1655% -7.8921%] (p = 0.00 <
0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
convert_rows 4096 string_dictionary_low_cardinality(10, 0)
time: [61.337 µs 61.348 µs 61.362 µs]
change: [-41.300% -41.123% -40.951%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low mild
2 (2.00%) high mild
4 (4.00%) high severe
convert_columns 4096 string_dictionary_low_cardinality(30, 0)
time: [39.096 µs 39.104 µs 39.113 µs]
change: [-7.1129% -6.9915% -6.7990%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
convert_columns_prepared 4096 string_dictionary_low_cardinality(30, 0)
time: [37.734 µs 37.741 µs 37.749 µs]
change: [-11.792% -10.671% -9.5985%] (p = 0.00 <
0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
5 (5.00%) high mild
5 (5.00%) high severe
convert_rows 4096 string_dictionary_low_cardinality(30, 0)
time: [60.281 µs 60.290 µs 60.300 µs]
change: [-42.061% -41.997% -41.879%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
convert_columns 4096 string_dictionary_low_cardinality(100, 0)
time: [58.589 µs 58.609 µs 58.636 µs]
change: [+38.632% +39.044% +39.486%] (p = 0.00 <
0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
convert_columns_prepared 4096 string_dictionary_low_cardinality(100, 0)
time: [57.292 µs 57.302 µs 57.313 µs]
change: [+39.663% +39.880% +40.261%] (p = 0.00 <
0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
3 (3.00%) high severe
convert_rows 4096 string_dictionary_low_cardinality(100, 0)
time: [91.827 µs 91.914 µs 92.020 µs]
change: [-11.573% -11.484% -11.406%] (p = 0.00 <
0.05)
Performance has improved.
```
This is inline with my expectations, in the ideal case of a small dictionary
containing large strings, the interning logic does represent a benefit, I'm not
sure how common this case is
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]