rluvaton opened a new pull request, #19520:
URL: https://github.com/apache/datafusion/pull/19520
## Which issue does this PR close?
N/A
## Rationale for this change
grouping on large amount of data is very slow
## What changes are included in this PR?
used my fork of hashbrown with prefetching and added prefetching when map is
large
## Are these changes tested?
No
## Are there any user-facing changes?
No
-----
Command:
```bash
datafusion-cli --command "select value from range(0,100000000) group by
value, value + 1;"
```
with the following config:
```sql
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
-- Adding this
set datafusion.execution.agg_prefetch_elements=0; -- 0 is disabled
set datafusion.execution.agg_prefetch_locality=3; -- no affect if
agg_prefetch_elements is 0
set datafusion.execution.agg_prefetch_read=false; -- no affect if
agg_prefetch_elements is 0
````
Without prefetch: 16.002 seconds
With prefetch: 10.176 seconds
Metric | Before | After | Improvement
-----------------|--------|--------|-----------------
LLC-load-misses | 91.9M | 25.7M | 72% reduction
LLC-loads | 202M | 101.7M | 50% reduction
Cycles | 62.4B | 39.8B | 36% reduction
IPC | 0.49 | 0.80 | 63% improvement
```
```
## Env
Machine: `c5.metal`
```
$ ./neofetch
`-/oydNNdyo:.` ec2-user@
`.:+shmMMMMMMMMMMMMMMmhs+:.`
------------------------------------------------------
-+hNNMMMMMMMMMMMMMMMMMMMMMMNNho- OS: Amazon Linux 2023.9.20251208
x86_64
.`` -/+shmNNMMMMMMNNmhs+/- ``. Host: Amazon EC2
dNmhs+:. `.:/oo/:.` .:+shmNd Kernel:
6.1.158-180.294.amzn2023.x86_64
dMMMMMMMNdhs+:.. ..:+shdNMMMMMMMd Uptime: 51 mins
dMMMMMMMMMMMMMMNds odNMMMMMMMMMMMMMMd Packages: 800 (rpm)
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd Shell: bash 5.2.15
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd Terminal: /dev/pts/1
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd CPU: Intel Xeon Platinum 8275CL
(96) @ 3.900GHz
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd Memory: 2514MiB / 193043MiB
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh yMMMMMMMMMMMMMMMMd
.:+ydNMMMMMMMMMMMh yMMMMMMMMMMMNdy+:.
`.:+shNMMMMMh yMMMMMNhs+:``
`-+shy shs+:`
```
```console
$ ./cpufetch --verbose
Name: Intel Xeon Platinum 8275CL
Microarchitecture: Cascade Lake
Technology: 14nm
Max Frequency: 3.900 GHz
Sockets: 2
Cores: 24 cores (48 threads)
Cores (Total): 48 cores (96 threads)
AVX: AVX,AVX2,AVX512
FMA: FMA3
L1i Size: 32KB (1.5MB Total)
L1d Size: 32KB (1.5MB Total)
L2 Size: 1MB (48MB Total)
L3 Size: 35.75MB (71.5MB Total)
Peak Performance: 11.98 TFLOP/s
```
<details>
<summary>Results</summary>
## Without prefetch
```console
perf stat -e
cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses
datafusion-cli --command "
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=0;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.001 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
+-------+
| value |
+-------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
| 19 |
| 20 |
| 21 |
| 22 |
| 23 |
| 24 |
| 25 |
| 26 |
| 27 |
| 28 |
| 29 |
| 30 |
| 31 |
| 32 |
| 33 |
| 34 |
| 35 |
| 36 |
| 37 |
| 38 |
| 39 |
| . |
| . |
| . |
+-------+
100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 16.002 seconds.
Performance counter stats for 'datafusion-cli --command
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=0;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
':
62399336707 cycles
(62.64%)
30518679380 instructions # 0.49 insn per
cycle (75.11%)
589924410 cache-references
(75.11%)
327996655 cache-misses # 55.600 % of all
cache refs (75.11%)
6616784604 L1-dcache-loads
(75.11%)
880862425 L1-dcache-load-misses # 13.31% of all
L1-dcache accesses (75.11%)
201998979 LLC-loads
(49.81%)
91945353 LLC-load-misses # 45.52% of all
LL-cache accesses (50.00%)
16.036173799 seconds time elapsed
14.416051000 seconds user
1.664161000 seconds sys
```
## With prefetch
```console
$ perf stat -e
cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses
datafusion-cli --command "
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=1;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.001 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
+-------+
| value |
+-------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
| 19 |
| 20 |
| 21 |
| 22 |
| 23 |
| 24 |
| 25 |
| 26 |
| 27 |
| 28 |
| 29 |
| 30 |
| 31 |
| 32 |
| 33 |
| 34 |
| 35 |
| 36 |
| 37 |
| 38 |
| 39 |
| . |
| . |
| . |
+-------+
100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 10.176 seconds.
Performance counter stats for 'datafusion-cli --command
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=1;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
':
39771220883 cycles
(62.54%)
31973887904 instructions # 0.80 insn per
cycle (75.06%)
633535114 cache-references
(75.18%)
374760871 cache-misses # 59.154 % of all
cache refs (75.18%)
6598313098 L1-dcache-loads
(75.18%)
920200907 L1-dcache-load-misses # 13.95% of all
L1-dcache accesses (75.18%)
101659922 LLC-loads
(49.66%)
25681158 LLC-load-misses # 25.26% of all
LL-cache accesses (49.84%)
10.204965327 seconds time elapsed
8.432703000 seconds user
1.806306000 seconds sys
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]