tustvold commented on PR #3558:
URL: https://github.com/apache/arrow-rs/pull/3558#issuecomment-1398812686

   I think this is now ready for review, benchmarks suggest re-encoding 
dictionaries is on the order of ~10µs for 1024 rows.
   
   The difficulty is the old method is very fast most of the time, on the order 
of 1µs, and so always re-encoding would result in a significant performance 
regression in the common case. To avoid this, I've stuck a fairly pessimistic 
heuristic that will only re-encode if the concatenation of the dictionary 
values would end up with more elements than the output has keys. 
   
   This helps to avoid the worst case most of the time, and I think is an 
acceptable trade-off but welcome other thoughts.
   
   Interleave with the same dictionary (which won't merge)
   ```
   interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]
                           time:   [1.6196 µs 1.6204 µs 1.6212 µs]
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]
                           time:   [4.4878 µs 4.4911 µs 4.4946 µs]
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                           time:   [10.931 µs 10.937 µs 10.944 µs]
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                           time:   [10.675 µs 10.678 µs 10.681 µs]
   Found 6 outliers among 100 measurements (6.00%)
     4 (4.00%) high mild
     2 (2.00%) high severe
   
   interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]
                           time:   [1.6390 µs 1.6399 µs 1.6407 µs]
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) low mild
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]
                           time:   [4.5065 µs 4.5080 µs 4.5096 µs]
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                           time:   [11.008 µs 11.014 µs 11.020 µs]
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                           time:   [10.684 µs 10.688 µs 10.692 µs]
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     4 (4.00%) high mild
     1 (1.00%) high severe
   ```
   
   Interleave with different dictionaries where one is sparse (will merge)
   ```
   interleave dict_distinct 100
                           time:   [5.2581 µs 5.2605 µs 5.2631 µs]
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) high mild
     4 (4.00%) high severe
   
   interleave dict_distinct 1024
                           time:   [5.2225 µs 5.2256 µs 5.2289 µs]
   Found 8 outliers among 100 measurements (8.00%)
     5 (5.00%) high mild
     3 (3.00%) high severe
   
   interleave dict_distinct 2048
                           time:   [5.0365 µs 5.0419 µs 5.0478 µs]
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   ```
   
   The concatenate kernel is typically much faster, and so merging does 
represent a non-trivial performance regression. (Note this is concatenating 2 
arrays of 1024, whereas the interleave benchmarks are interleaving to produce 
an array of 1024 rows).
   
   ```
   concat str_dict 1024    time:   [2.7670 µs 2.7692 µs 2.7711 µs]
   
   concat str_dict_sparse 1024
                           time:   [21.122 µs 21.136 µs 21.151 µs]
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to