ryankert01 commented on issue #714:
URL: https://github.com/apache/mahout/issues/714#issuecomment-3772856469

   # Fused L2 Norm + Amplitude Encoding Kernel Analysis
   
   ## Current Approach (2 Kernels)
   
   ```
   Kernel 1: L2 norm reduction    → Read input (N elements)
   D2H copy: Get inv_norm         → 8 bytes
   Kernel 2: Amplitude encode     → Read input + Write output (2N elements)
   
   Total memory: 3N reads + N writes
   ```
   
   ## Fused Approach (1 Kernel)
   
   ```
   Single kernel: Norm + Encode   → Read input + Write output (N + N elements)
   
   Total memory: N reads + N writes (33% fewer reads)
   ```
   
   ## The Challenge
   
   L2 norm requires **global reduction** (sum all x²) before encoding can 
start. CUDA doesn't have native cross-block synchronization within a single 
kernel.
   
   ## Implementation Options
   
   | Approach | Pros | Cons |
   |----------|------|------|
   | Cooperative Groups | True single kernel, clean | Limited grid size, 
requires `cudaLaunchCooperativeKernel` |
   | Two-phase in one kernel | Saves kernel launch overhead | Still needs 
device-wide sync point |
   | Atomic reduction | Simple to implement | Slow for large N due to 
contention |
   | Persistent threads | Flexible | Complex, hard to maintain |
   
   ## Cost-Benefit Analysis
   
   | Data Size | Current Overhead | Fused Benefit | Verdict |
   |-----------|------------------|---------------|---------|
   | Small (<1MB) | ~10µs kernel launch | Negligible | Not worth complexity |
   | Medium (1-10MB) | ~20µs total | Minor bandwidth savings | Marginal benefit 
|
   | Large (>10MB) | Memory bandwidth bound | 33% read reduction | Potentially 
worth it |
   
   ## Current Implementation Benefits
   
   1. **Simplicity** - Two well-optimized, separate kernels are easier to 
maintain
   2. **Error handling** - Host-side validation between kernels provides clear 
error messages
   3. **Async pipeline** - Large data already uses overlapped 
transfers/computation
   4. **Debugging** - Easier to profile and optimize individual kernels
   
   ## When Fusion Would Make Sense
   
   - Ultra-low-latency requirements (every microsecond matters)
   - Very large tensors where memory bandwidth is the primary bottleneck
   - Profiling shows kernel launch overhead is a significant fraction of total 
time
   - Batch encoding where the same norm is applied to many outputs (not the 
case here)
   
   ## Recommendation
   
   **Not recommended for this codebase at this time.**
   
   Reasons:
   1. Current benchmarks already show 2-11x speedup with the zero-copy approach
   2. The complexity/benefit tradeoff doesn't favor fusion
   3. Cooperative groups have grid size limitations that may not work for all 
input sizes
   4. The 8-byte D2H copy for validation is negligible
   
   ## Future Consideration
   
   If profiling reveals that:
   - Kernel launch overhead > 10% of total encode time, OR
   - Memory bandwidth is saturated on large inputs
   
   Then revisit fusion using cooperative groups for medium-sized inputs where 
the grid fits within device limits.
   
   ## References
   
   - [CUDA Cooperative 
Groups](https://developer.nvidia.com/blog/cooperative-groups/)
   - [Parallel Reduction 
Optimization](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to