botszhuang opened a new pull request, #1380: URL: https://github.com/apache/mahout/pull/1380
### Related Issues Relates to #1227 ### Changes - [ ] Bug fix - [ ] New feature - [x] Refactoring - [ ] Documentation - [ ] Test - [ ] CI/CD pipeline - [ ] Other ### Why This PR aims to strengthen the CUDA kernel implementation of `phase.cu` by optimizing the norm calculation. Through PTX analysis and empirical benchmarks, I found that combining the norm multiplication into the existing loop yields a better performance improvement than using `pow()` or solely focusing on branch divergence. ### How #### Optimization Details & Insights: norm Calculation Optimization (The Real Winner 🚀) - **Hypothesis**: Using `pow(M_SQRT1_2, num_qubits)` via SFU would speed up the normalization factor. - **Benchmark Reality**: `pow()` actually introduced overhead and slowed down the execution. - **The Better Solution**: Embedding the norm scaling `(norm *= M_SQRT1_2;)` directly into the loop alongside the phase accumulation achieved the best results. This allows the compiler to pipelining the instructions effectively. #### Benchmark Results - **Environment**: RunPod GPU instance (CUDA 12.8, NVVM 7.0.1) - **Configuration**: Grid size: 2048, Block size: 512 | Implementation | Execution Time (ms) | Speedup | | :--- | :--- | :--- | | **Original** | 0.4434 ms | Baseline | | **Inline Norm in Loop (This PR)** | **0.3995 ms** | **~5% Performance Gain** | ## Checklist - [ ] Added or updated unit tests for all changes - [x] Added or updated documentation for all changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
