rluvaton opened a new pull request, #19520:
URL: https://github.com/apache/datafusion/pull/19520

   ## Which issue does this PR close?
   
   N/A
   
   ## Rationale for this change
   
   grouping on large amount of data is very slow
   
   ## What changes are included in this PR?
   
   used my fork of hashbrown with prefetching and added prefetching when map is 
large
    
   ## Are these changes tested?
   
   No
   
   ## Are there any user-facing changes?
   
   No
   
   -----
   
   Command:
   ```bash
   datafusion-cli --command "select value from range(0,100000000) group by 
value, value + 1;"
   ```
   
   with the following config:
   ```sql
   set datafusion.execution.coalesce_batches=false;
   set datafusion.optimizer.repartition_aggregations=false;
   set datafusion.optimizer.enable_round_robin_repartition=false;
   
   -- Adding this
   set datafusion.execution.agg_prefetch_elements=0; -- 0 is disabled 
   set datafusion.execution.agg_prefetch_locality=3; -- no affect if 
agg_prefetch_elements is 0 
   set datafusion.execution.agg_prefetch_read=false; -- no affect if 
agg_prefetch_elements is 0
   ````
   
   Without prefetch: 16.002 seconds
   With prefetch: 10.176 seconds
   
    Metric          | Before | After  | Improvement     
   -----------------|--------|--------|-----------------
    LLC-load-misses | 91.9M  | 25.7M  | 72% reduction   
    LLC-loads       | 202M   | 101.7M | 50% reduction   
    Cycles          | 62.4B  | 39.8B  | 36% reduction   
    IPC             | 0.49   | 0.80   | 63% improvement 
   
   ```
   
   ```
   
   ## Env
   
   Machine: `c5.metal`
   ```
   $ ./neofetch
                `-/oydNNdyo:.`                ec2-user@
         `.:+shmMMMMMMMMMMMMMMmhs+:.`         
------------------------------------------------------
       -+hNNMMMMMMMMMMMMMMMMMMMMMMNNho-       OS: Amazon Linux 2023.9.20251208 
x86_64
   .``      -/+shmNNMMMMMMNNmhs+/-      ``.   Host: Amazon EC2
   dNmhs+:.       `.:/oo/:.`       .:+shmNd   Kernel: 
6.1.158-180.294.amzn2023.x86_64
   dMMMMMMMNdhs+:..        ..:+shdNMMMMMMMd   Uptime: 51 mins
   dMMMMMMMMMMMMMMNds    odNMMMMMMMMMMMMMMd   Packages: 800 (rpm)
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Shell: bash 5.2.15
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Terminal: /dev/pts/1
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   CPU: Intel Xeon Platinum 8275CL 
(96) @ 3.900GHz
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Memory: 2514MiB / 193043MiB
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
   dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
   .:+ydNMMMMMMMMMMMh    yMMMMMMMMMMMNdy+:.
        `.:+shNMMMMMh    yMMMMMNhs+:``
               `-+shy    shs+:`
   
   ```
   
   ```console
   $ ./cpufetch --verbose
   Name:                Intel Xeon Platinum 8275CL
   Microarchitecture:   Cascade Lake
   Technology:          14nm
   Max Frequency:       3.900 GHz
   Sockets:             2
   Cores:               24 cores (48 threads)
   Cores (Total):       48 cores (96 threads)
   AVX:                 AVX,AVX2,AVX512
   FMA:                 FMA3
   L1i Size:            32KB (1.5MB Total)
   L1d Size:            32KB (1.5MB Total)
   L2 Size:             1MB (48MB Total)
   L3 Size:             35.75MB (71.5MB Total)
   Peak Performance:    11.98 TFLOP/s
   ```
   
   <details>
   <summary>Results</summary>
   
   
   
   
   
   ## Without prefetch
   ```console
   perf stat -e 
cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses
 datafusion-cli --command "
   set datafusion.execution.coalesce_batches=false;
   set datafusion.optimizer.repartition_aggregations=false;
   set datafusion.optimizer.enable_round_robin_repartition=false;
   set datafusion.execution.agg_prefetch_elements=0;
   set datafusion.execution.agg_prefetch_locality=3;
   set datafusion.execution.agg_prefetch_read=false;
   select value from range(0,100000000) group by value, value + 1;
   "
   DataFusion CLI v51.0.0
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   +-------+
   | value |
   +-------+
   | 0     |
   | 1     |
   | 2     |
   | 3     |
   | 4     |
   | 5     |
   | 6     |
   | 7     |
   | 8     |
   | 9     |
   | 10    |
   | 11    |
   | 12    |
   | 13    |
   | 14    |
   | 15    |
   | 16    |
   | 17    |
   | 18    |
   | 19    |
   | 20    |
   | 21    |
   | 22    |
   | 23    |
   | 24    |
   | 25    |
   | 26    |
   | 27    |
   | 28    |
   | 29    |
   | 30    |
   | 31    |
   | 32    |
   | 33    |
   | 34    |
   | 35    |
   | 36    |
   | 37    |
   | 38    |
   | 39    |
   | .     |
   | .     |
   | .     |
   +-------+
   100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
   Elapsed 16.002 seconds.
   
   
    Performance counter stats for 'datafusion-cli --command
   set datafusion.execution.coalesce_batches=false;
   set datafusion.optimizer.repartition_aggregations=false;
   set datafusion.optimizer.enable_round_robin_repartition=false;
   set datafusion.execution.agg_prefetch_elements=0;
   set datafusion.execution.agg_prefetch_locality=3;
   set datafusion.execution.agg_prefetch_read=false;
   select value from range(0,100000000) group by value, value + 1;
   ':
   
          62399336707      cycles                                               
                (62.64%)
          30518679380      instructions                     #    0.49  insn per 
cycle           (75.11%)
            589924410      cache-references                                     
                (75.11%)
            327996655      cache-misses                     #   55.600 % of all 
cache refs      (75.11%)
           6616784604      L1-dcache-loads                                      
                (75.11%)
            880862425      L1-dcache-load-misses            #   13.31% of all 
L1-dcache accesses  (75.11%)
            201998979      LLC-loads                                            
                (49.81%)
             91945353      LLC-load-misses                  #   45.52% of all 
LL-cache accesses  (50.00%)
   
         16.036173799 seconds time elapsed
   
         14.416051000 seconds user
          1.664161000 seconds sys
   ```
   
   ## With prefetch
   ```console
   $ perf stat -e 
cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses
 datafusion-cli --command "
   set datafusion.execution.coalesce_batches=false;
   set datafusion.optimizer.repartition_aggregations=false;
   set datafusion.optimizer.enable_round_robin_repartition=false;
   set datafusion.execution.agg_prefetch_elements=1;
   set datafusion.execution.agg_prefetch_locality=3;
   set datafusion.execution.agg_prefetch_read=false;
   select value from range(0,100000000) group by value, value + 1;
   "
   DataFusion CLI v51.0.0
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   +-------+
   | value |
   +-------+
   | 0     |
   | 1     |
   | 2     |
   | 3     |
   | 4     |
   | 5     |
   | 6     |
   | 7     |
   | 8     |
   | 9     |
   | 10    |
   | 11    |
   | 12    |
   | 13    |
   | 14    |
   | 15    |
   | 16    |
   | 17    |
   | 18    |
   | 19    |
   | 20    |
   | 21    |
   | 22    |
   | 23    |
   | 24    |
   | 25    |
   | 26    |
   | 27    |
   | 28    |
   | 29    |
   | 30    |
   | 31    |
   | 32    |
   | 33    |
   | 34    |
   | 35    |
   | 36    |
   | 37    |
   | 38    |
   | 39    |
   | .     |
   | .     |
   | .     |
   +-------+
   100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
   Elapsed 10.176 seconds.
   
   
    Performance counter stats for 'datafusion-cli --command
   set datafusion.execution.coalesce_batches=false;
   set datafusion.optimizer.repartition_aggregations=false;
   set datafusion.optimizer.enable_round_robin_repartition=false;
   set datafusion.execution.agg_prefetch_elements=1;
   set datafusion.execution.agg_prefetch_locality=3;
   set datafusion.execution.agg_prefetch_read=false;
   select value from range(0,100000000) group by value, value + 1;
   ':
   
          39771220883      cycles                                               
                (62.54%)
          31973887904      instructions                     #    0.80  insn per 
cycle           (75.06%)
            633535114      cache-references                                     
                (75.18%)
            374760871      cache-misses                     #   59.154 % of all 
cache refs      (75.18%)
           6598313098      L1-dcache-loads                                      
                (75.18%)
            920200907      L1-dcache-load-misses            #   13.95% of all 
L1-dcache accesses  (75.18%)
            101659922      LLC-loads                                            
                (49.66%)
             25681158      LLC-load-misses                  #   25.26% of all 
LL-cache accesses  (49.84%)
   
         10.204965327 seconds time elapsed
   
          8.432703000 seconds user
          1.806306000 seconds sys
   
   ```
   
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to