ryankert01 commented on issue #714: URL: https://github.com/apache/mahout/issues/714#issuecomment-3772856469
# Fused L2 Norm + Amplitude Encoding Kernel Analysis ## Current Approach (2 Kernels) ``` Kernel 1: L2 norm reduction → Read input (N elements) D2H copy: Get inv_norm → 8 bytes Kernel 2: Amplitude encode → Read input + Write output (2N elements) Total memory: 3N reads + N writes ``` ## Fused Approach (1 Kernel) ``` Single kernel: Norm + Encode → Read input + Write output (N + N elements) Total memory: N reads + N writes (33% fewer reads) ``` ## The Challenge L2 norm requires **global reduction** (sum all x²) before encoding can start. CUDA doesn't have native cross-block synchronization within a single kernel. ## Implementation Options | Approach | Pros | Cons | |----------|------|------| | Cooperative Groups | True single kernel, clean | Limited grid size, requires `cudaLaunchCooperativeKernel` | | Two-phase in one kernel | Saves kernel launch overhead | Still needs device-wide sync point | | Atomic reduction | Simple to implement | Slow for large N due to contention | | Persistent threads | Flexible | Complex, hard to maintain | ## Cost-Benefit Analysis | Data Size | Current Overhead | Fused Benefit | Verdict | |-----------|------------------|---------------|---------| | Small (<1MB) | ~10µs kernel launch | Negligible | Not worth complexity | | Medium (1-10MB) | ~20µs total | Minor bandwidth savings | Marginal benefit | | Large (>10MB) | Memory bandwidth bound | 33% read reduction | Potentially worth it | ## Current Implementation Benefits 1. **Simplicity** - Two well-optimized, separate kernels are easier to maintain 2. **Error handling** - Host-side validation between kernels provides clear error messages 3. **Async pipeline** - Large data already uses overlapped transfers/computation 4. **Debugging** - Easier to profile and optimize individual kernels ## When Fusion Would Make Sense - Ultra-low-latency requirements (every microsecond matters) - Very large tensors where memory bandwidth is the primary bottleneck - Profiling shows kernel launch overhead is a significant fraction of total time - Batch encoding where the same norm is applied to many outputs (not the case here) ## Recommendation **Not recommended for this codebase at this time.** Reasons: 1. Current benchmarks already show 2-11x speedup with the zero-copy approach 2. The complexity/benefit tradeoff doesn't favor fusion 3. Cooperative groups have grid size limitations that may not work for all input sizes 4. The 8-byte D2H copy for validation is negligible ## Future Consideration If profiling reveals that: - Kernel launch overhead > 10% of total encode time, OR - Memory bandwidth is saturated on large inputs Then revisit fusion using cooperative groups for medium-sized inputs where the grid fits within device limits. ## References - [CUDA Cooperative Groups](https://developer.nvidia.com/blog/cooperative-groups/) - [Parallel Reduction Optimization](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
