rich7420 opened a new pull request, #708:
URL: https://github.com/apache/mahout/pull/708
### Purpose of PR
Implements a streaming producer-consumer pipeline for large Parquet
datasets. Features async IO-GPU overlap using threads and CUDA streams. Adds
Pinned Memory for efficient H2D transfers. Introduces a fused CUDA kernel
merging L2 normalization and encoding, optimizing throughput and memory usage
for batch amplitude encoding. I'm sorry about this PR too big.
Before
```
** CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- --------
-------- ----------- ----------------------
56.8 6233567 2 3116783.5 3116783.5 16689
6216878 4384195.7 cuMemAllocAsync
23.0 2523663 1 2523663.0 2523663.0 2523663
2523663 0.0 cuMemcpyHtoDAsync_v2
15.5 1700184 2 850092.0 850092.0 123903
1576281 1026986.3 cudaMemGetInfo
3.0 331958 1 331958.0 331958.0 331958
331958 0.0 cudaLaunchKernel
0.9 103171 2 51585.5 51585.5 38644
64527 18302.0 cuStreamSynchronize
0.4 48043 412 116.6 88.0 53
4443 224.1 cuGetProcAddress_v2
0.2 19526 9 2169.6 1614.0 193
7217 2304.0 cuCtxSetCurrent
0.0 4507 2 2253.5 2253.5 1239
3268 1434.7 cuMemFreeAsync
0.0 1956 1 1956.0 1956.0 1956
1956 0.0 cuInit
0.0 1489 1 1489.0 1489.0 1489
1489 0.0 cuEventCreate
0.0 788 1 788.0 788.0 788
788 0.0 cuEventDestroy_v2
```
After
```
** CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- --------
-------- ----------- ----------------------
78.7 5843139 2 2921569.5 2921569.5 2520
5840619 4128159.4 cuMemAllocAsync
17.4 1292311 1 1292311.0 1292311.0 1292311
1292311 0.0 cuMemcpyHtoDAsync_v2
2.8 205304 1 205304.0 205304.0 205304
205304 0.0 cudaLaunchKernel
0.7 53109 412 128.9 95.0 59
4253 219.1 cuGetProcAddress_v2
0.2 15512 9 1723.6 1059.0 218
5169 1795.5 cuCtxSetCurrent
0.1 10387 2 5193.5 5193.5 2382
8005 3976.1 cuStreamSynchronize
0.1 4185 2 2092.5 2092.5 967
3218 1591.7 cuMemFreeAsync
0.0 1669 1 1669.0 1669.0 1669
1669 0.0 cuEventCreate
0.0 1304 1 1304.0 1304.0 1304
1304 0.0 cuInit
0.0 774 1 774.0 774.0 774
774 0.0 cuEventDestroy_v2
0.0 85 1 85.0 85.0 85
85 0.0 cuModuleGetLoadingMode
```
** NVTX Range Summary (nvtx_sum):
### Related Issues or PRs
<!-- Add links to related issues or PRs. -->
<!-- - Closes #123 -->
<!-- - Related to #123 -->
Related to #699
### Changes Made
<!-- Please mark one with an "x" -->
- [ ] Bug fix
- [x] New feature
- [x] Refactoring
- [ ] Documentation
- [ ] Test
- [ ] CI/CD pipeline
- [ ] Other
### Breaking Changes
<!-- Does this PR introduce a breaking change? -->
- [x] Yes
- [ ] No
### Checklist
<!-- Please mark each item with an "x" when complete -->
<!-- If not all items are complete, please open this as a **Draft PR**.
Once all requirements are met, mark as ready for review. -->
- [ ] Added or updated unit tests for all changes
- [ ] Added or updated documentation for all changes
- [x] Successfully built and ran all unit tests or manual tests locally
- [ ] PR title follows "MAHOUT-XXX: Brief Description" format (if related to
an issue)
- [ ] Code follows ASF guidelines
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]