tustvold commented on PR #3558:
URL: https://github.com/apache/arrow-rs/pull/3558#issuecomment-1398812686
I think this is now ready for review, benchmarks suggest re-encoding
dictionaries is on the order of ~10µs for 1024 rows.
The difficulty is the old method is very fast most of the time, on the order
of 1µs, and so always re-encoding would result in a significant performance
regression in the common case. To avoid this, I've stuck a fairly pessimistic
heuristic that will only re-encode if the concatenation of the dictionary
values would end up with more elements than the output has keys.
This helps to avoid the worst case most of the time, and I think is an
acceptable trade-off but welcome other thoughts.
Interleave with the same dictionary (which won't merge)
```
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]
time: [1.6196 µs 1.6204 µs 1.6212 µs]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]
time: [4.4878 µs 4.4911 µs 4.4946 µs]
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]
time: [10.931 µs 10.937 µs 10.944 µs]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
time: [10.675 µs 10.678 µs 10.681 µs]
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]
time: [1.6390 µs 1.6399 µs 1.6407 µs]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]
time: [4.5065 µs 4.5080 µs 4.5096 µs]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]
time: [11.008 µs 11.014 µs 11.020 µs]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
time: [10.684 µs 10.688 µs 10.692 µs]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
```
Interleave with different dictionaries where one is sparse (will merge)
```
interleave dict_distinct 100
time: [5.2581 µs 5.2605 µs 5.2631 µs]
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
interleave dict_distinct 1024
time: [5.2225 µs 5.2256 µs 5.2289 µs]
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
interleave dict_distinct 2048
time: [5.0365 µs 5.0419 µs 5.0478 µs]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
```
The concatenate kernel is typically much faster, and so merging does
represent a non-trivial performance regression. (Note this is concatenating 2
arrays of 1024, whereas the interleave benchmarks are interleaving to produce
an array of 1024 rows).
```
concat str_dict 1024 time: [2.7670 µs 2.7692 µs 2.7711 µs]
concat str_dict_sparse 1024
time: [21.122 µs 21.136 µs 21.151 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]