rich7420 commented on PR #972:
URL: https://github.com/apache/mahout/pull/972#issuecomment-3816575134
```
cd /home/rich-wsl/mahout/qdp/qdp-python/benchmark
export QDP_ENABLE_POOL_METRICS=1
export QDP_ENABLE_OVERLAP_TRACKING=1
export RUST_LOG=info
uv run python run_pipeline_baseline.py --qubits 16 --batch-size 64
--prefetch 16 --batches 500 --trials 20
```
# baseline report
- **Date**: 2026-01-29
- **Git commit**: ef00f92eb236
- **GPU**: NVIDIA GeForce RTX 3080
- **Driver**: 560.94
- **CUDA**: 12.1
## Parameters
- qubits: 16
- batch_size: 64
- prefetch: 16
- batches: 200
- trials: 5
- encoding: amplitude
## Results
| Metric | Median | P95 |
|--------|--------|-----|
| Throughput (vectors/sec) | 1454.2 | 1742.3 |
| Latency (ms/vector) | 0.720 | 0.731 |
---
```
cd /home/rich-wsl/mahout/qdp/qdp-python && nsys profile --trace=cuda,nvtx
--output=../docs/optimization/results/baseline_before_uv uv run python
benchmark/benchmark_throughput.py --qubits 16 --batches 200 --batch-size 64
--prefetch 16 --frameworks mahout
cd /home/rich-wsl/mahout/qdp && nsys stats
docs/optimization/results/baseline_before_uv.sqlite
``` ** CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- --------
-------- ----------- ----------------------------
60.3 1154513612 200 5772568.1 5600405.0 4603514
8527926 818264.7 cuMemcpyHtoDAsync_v2
14.1 269678972 800 337098.7 160114.5 1525
2679828 573425.8 cuStreamSynchronize
12.7 243282018 200 1216410.1 1035882.0 76003
2659132 499445.4 cuMemcpyDtoHAsync_v2
4.8 91270891 800 114088.6 7550.5 465
7939629 329877.7 cuMemAllocAsync
3.3 63990841 1200 53325.7 23571.5 6390
17779574 652454.9 cudaLaunchKernel
3.0 56531757 400 141329.4 135325.0 26292
422046 65129.1 cudaMemGetInfo
0.4 8475510 400 21188.8 12548.0 7223
100009 16110.7 cudaMemsetAsync
0.4 6802300 200 34011.5 32870.5 17636
213820 17152.8 cuLaunchKernel
0.3 6583789 200 32918.9 24025.5 14132
144830 18166.2 cuMemsetD8Async
0.2 4736816 3002 1577.9 863.5 145
67826 2298.3 cuCtxSetCurrent
0.2 3457162 2 1728581.0 1728581.0 8156
3449006 2433048.4 cudaDeviceSynchronize
0.1 2554445 800 3193.1 2796.0 970
47088 3201.1 cuMemFreeAsync
0.1 1376489 4 344122.3 346245.0 301121
382878 37481.3 cudaMalloc
0.0 174142 1 174142.0 174142.0 174142
174142 0.0 cuModuleLoadData
0.0 42998 383 112.3 90.0 52
708 73.1 cuGetProcAddress_v2
0.0 23813 7 3401.9 2823.0 338
11571 3797.4 cudaStreamIsCapturing_v10000
0.0 1173 1 1173.0 1173.0 1173
1173 0.0 cuEventCreate
0.0 1064 1 1064.0 1064.0 1064
1064 0.0 cuInit
0.0 599 1 599.0 599.0 599
599 0.0 cuEventDestroy_v2
0.0 91 1 91.0 91.0 91
91 0.0 cuModuleGetLoadingMode
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]