[PR] [QDP] improve memory management [mahout]

via GitHub Wed, 10 Dec 2025 05:01:06 -0800


rich7420 opened a new pull request, #708:
URL: https://github.com/apache/mahout/pull/708


   ### Purpose of PR
   
   Implements a streaming producer-consumer pipeline for large Parquet 
datasets. Features async IO-GPU overlap using threads and CUDA streams. Adds 
Pinned Memory for efficient H2D transfers. Introduces a fused CUDA kernel 
merging L2 normalization and encoding, optimizing throughput and memory usage 
for batch amplitude encoding. I'm sorry about this PR too big.
   
   Before
   ```
   ** CUDA API Summary (cuda_api_sum):
   
    Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max 
(ns)  StdDev (ns)           Name         
    --------  ---------------  ---------  ---------  ---------  --------  
--------  -----------  ----------------------
        56.8          6233567          2  3116783.5  3116783.5     16689   
6216878    4384195.7  cuMemAllocAsync       
        23.0          2523663          1  2523663.0  2523663.0   2523663   
2523663          0.0  cuMemcpyHtoDAsync_v2  
        15.5          1700184          2   850092.0   850092.0    123903   
1576281    1026986.3  cudaMemGetInfo        
         3.0           331958          1   331958.0   331958.0    331958    
331958          0.0  cudaLaunchKernel      
         0.9           103171          2    51585.5    51585.5     38644     
64527      18302.0  cuStreamSynchronize   
         0.4            48043        412      116.6       88.0        53      
4443        224.1  cuGetProcAddress_v2   
         0.2            19526          9     2169.6     1614.0       193      
7217       2304.0  cuCtxSetCurrent       
         0.0             4507          2     2253.5     2253.5      1239      
3268       1434.7  cuMemFreeAsync        
         0.0             1956          1     1956.0     1956.0      1956      
1956          0.0  cuInit                
         0.0             1489          1     1489.0     1489.0      1489      
1489          0.0  cuEventCreate         
         0.0              788          1      788.0      788.0       788       
788          0.0  cuEventDestroy_v2
   ```
   
   After
   ```
    ** CUDA API Summary (cuda_api_sum):
   
    Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max 
(ns)  StdDev (ns)           Name         
    --------  ---------------  ---------  ---------  ---------  --------  
--------  -----------  ----------------------
        78.7          5843139          2  2921569.5  2921569.5      2520   
5840619    4128159.4  cuMemAllocAsync       
        17.4          1292311          1  1292311.0  1292311.0   1292311   
1292311          0.0  cuMemcpyHtoDAsync_v2  
         2.8           205304          1   205304.0   205304.0    205304    
205304          0.0  cudaLaunchKernel      
         0.7            53109        412      128.9       95.0        59      
4253        219.1  cuGetProcAddress_v2   
         0.2            15512          9     1723.6     1059.0       218      
5169       1795.5  cuCtxSetCurrent       
         0.1            10387          2     5193.5     5193.5      2382      
8005       3976.1  cuStreamSynchronize   
         0.1             4185          2     2092.5     2092.5       967      
3218       1591.7  cuMemFreeAsync        
         0.0             1669          1     1669.0     1669.0      1669      
1669          0.0  cuEventCreate         
         0.0             1304          1     1304.0     1304.0      1304      
1304          0.0  cuInit                
         0.0              774          1      774.0      774.0       774       
774          0.0  cuEventDestroy_v2     
         0.0               85          1       85.0       85.0        85        
85          0.0  cuModuleGetLoadingMode
   ```
   
   ** NVTX Range Summary (nvtx_sum):
   
   
   
   
   
   ### Related Issues or PRs
   <!-- Add links to related issues or PRs. -->
   <!-- - Closes #123  -->
   <!-- - Related to #123   -->
   Related to #699 
   
   ### Changes Made
   <!-- Please mark one with an "x"   -->
   - [ ] Bug fix
   - [x] New feature
   - [x] Refactoring
   - [ ] Documentation
   - [ ] Test
   - [ ] CI/CD pipeline
   - [ ] Other
   
   ### Breaking Changes
   <!-- Does this PR introduce a breaking change? -->
   - [x] Yes
   - [ ] No
   
   ### Checklist
   <!-- Please mark each item with an "x" when complete -->
   <!-- If not all items are complete, please open this as a **Draft PR**.
   Once all requirements are met, mark as ready for review. -->
   
   - [ ] Added or updated unit tests for all changes
   - [ ] Added or updated documentation for all changes
   - [x] Successfully built and ran all unit tests or manual tests locally
   - [ ] PR title follows "MAHOUT-XXX: Brief Description" format (if related to 
an issue)
   - [ ] Code follows ASF guidelines
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [QDP] improve memory management [mahout]

Reply via email to