This is an automated email from the ASF dual-hosted git repository.
richhuang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/mahout.git
The following commit(s) were added to refs/heads/main by this push:
new d811c26b4 [QDP] Update NVTX workflow docs for new async pipeline
capture (#939)
d811c26b4 is described below
commit d811c26b4091a633c12f7591a7e9c0c7b04ac9a5
Author: KUAN-HAO HUANG <[email protected]>
AuthorDate: Tue Jan 27 12:31:57 2026 +0800
[QDP] Update NVTX workflow docs for new async pipeline capture (#939)
* Update NVTX workflow docs for new async pipeline capture
* remove redundant part
---
qdp/docs/observability/NVTX_USAGE.md | 87 ++++++++++++++---------------------
qdp/qdp-core/examples/nvtx_profile.rs | 12 +++--
qdp/qdp-core/src/gpu/pipeline.rs | 15 ++++--
3 files changed, 55 insertions(+), 59 deletions(-)
diff --git a/qdp/docs/observability/NVTX_USAGE.md
b/qdp/docs/observability/NVTX_USAGE.md
index a4fe92ee1..179f1ce09 100644
--- a/qdp/docs/observability/NVTX_USAGE.md
+++ b/qdp/docs/observability/NVTX_USAGE.md
@@ -4,19 +4,14 @@
NVTX (NVIDIA Tools Extension) provides performance markers visible in Nsight
Systems. This project uses zero-cost macros that compile to no-ops when the
`observability` feature is disabled.
-## Build with NVTX
+## Run the NVTX Example
-Default builds exclude NVTX for zero overhead. Enable profiling with:
+Default builds exclude NVTX for zero overhead. The example below uses the
+async pipeline workload (large input) to surface the new pipeline markers.
```bash
cd mahout/qdp
-cargo build -p qdp-core --example nvtx_profile --features observability
--release
-```
-
-## Run Example
-
-```bash
-./target/release/examples/nvtx_profile
+cargo run -p qdp-core --example nvtx_profile --features observability --release
```
**Expected output:**
@@ -24,7 +19,7 @@ cargo build -p qdp-core --example nvtx_profile --features
observability --releas
=== NVTX Profiling Example ===
✓ Engine initialized
-✓ Created test data: 1024 elements
+✓ Created test data: 262144 elements
Starting encoding (NVTX markers will appear in Nsight Systems)...
Expected NVTX markers:
@@ -32,9 +27,10 @@ Expected NVTX markers:
- CPU::L2Norm
- GPU::Alloc
- GPU::H2DCopy
- - GPU::KernelLaunch
- - GPU::Synchronize
- - DLPack::Wrap
+ - GPU::CopyEventRecord
+ - GPU::H2D_Stage
+ - GPU::Kernel
+ - GPU::ComputeSync
✓ Encoding succeeded
✓ DLPack pointer: 0x558114be6250
@@ -45,11 +41,15 @@ Expected NVTX markers:
## Profile with Nsight Systems
+Focus capture on the `Mahout::Encode` range (recommended):
+
```bash
-nsys profile --trace=cuda,nvtx -o report ./target/release/examples/nvtx_profile
+nsys profile --trace=cuda,nvtx --capture-range=nvtx \
+ --nvtx-capture=Mahout::Encode --force-overwrite=true -o nvtx-workflow \
+ cargo run -p qdp-core --example nvtx_profile --features observability
--release
```
-This generates `report.nsys-rep` and `report.sqlite`.
+This generates `nvtx-workflow.nsys-rep` and `nvtx-workflow.sqlite`.
## Viewing Results
@@ -64,48 +64,18 @@ nsys-ui report.nsys-rep
In the GUI timeline view, you will see:
- Colored blocks for each NVTX marker
- CPU timeline showing `CPU::L2Norm`
-- GPU timeline showing `GPU::Alloc`, `GPU::H2DCopy`, `GPU::Kernel`
+- GPU timeline showing `GPU::Alloc`, `GPU::H2DCopy`, `GPU::CopyEventRecord`,
`GPU::H2D_Stage`, `GPU::Kernel`, `GPU::ComputeSync`
- Overall workflow covered by `Mahout::Encode`
### Command Line Statistics
-View summary statistics:
+NVTX range summary:
```bash
-nsys stats report.nsys-rep
-```
-
-**Example NVTX Range Summary output:**
-```
-Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns)
Max (ns) StdDev (ns) Style Range
--------- --------------- --------- ------------ ------------ ----------
---------- ----------- -------- --------------
- 50.0 11,207,505 1 11,207,505.0 11,207,505.0 11,207,505
11,207,505 0.0 StartEnd Mahout::Encode
- 48.0 10,759,758 1 10,759,758.0 10,759,758.0 10,759,758
10,759,758 0.0 StartEnd GPU::Alloc
- 1.8 413,753 1 413,753.0 413,753.0 413,753
413,753 0.0 StartEnd CPU::L2Norm
- 0.1 15,873 1 15,873.0 15,873.0 15,873
15,873 0.0 StartEnd GPU::H2DCopy
- 0.0 317 1 317.0 317.0 317
317 0.0 StartEnd GPU::KernelLaunch
+nsys stats --report nvtx_sum nvtx-workflow.nsys-rep
```
-The output shows:
-- Time percentage for each operation
-- Total time in nanoseconds
-- Number of instances
-- Average, median, min, max execution times
-
-**CUDA API Summary** shows detailed CUDA call statistics:
-
- Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns)
Max (ns) StdDev (ns) Name
- -------- --------------- --------- ----------- ----------- --------
---------- ----------- --------------------
- 99.2 11,760,277 2 5,880,138.5 5,880,138.5 2,913
11,757,364 8,311,652.0 cuMemAllocAsync
- 0.4 45,979 2 22,989.5 22,989.5 7,938
38,041 21,286.0 cuMemcpyHtoDAsync_v2
- 0.1 14,722 1 14,722.0 14,722.0 14,722
14,722 0.0 cuEventCreate
- 0.1 13,100 3 4,366.7 3,512.0 861
8,727 4,002.0 cuStreamSynchronize
- 0.1 9,468 11 860.7 250.0 114
4,671 1,453.3 cuCtxSetCurrent
- 0.1 6,479 1 6,479.0 6,479.0 6,479
6,479 0.0 cuEventDestroy_v2
- 0.0 4,599 2 2,299.5 2,299.5 1,773
2,826 744.6 cuMemFreeAsync
-- Memory allocation (`cuMemAllocAsync`)
-- Memory copies (`cuMemcpyHtoDAsync_v2`)
-- Stream synchronization (`cuStreamSynchronize`)
+Note: very short pipeline ranges may be easier to verify in the GUI timeline.
## NVTX Markers
@@ -115,9 +85,13 @@ The following markers are tracked:
- `CPU::L2Norm` - L2 normalization on CPU
- `GPU::Alloc` - GPU memory allocation
- `GPU::H2DCopy` - Host-to-device memory copy
-- `GPU::KernelLaunch` - CPU-side kernel launch
-- `GPU::Synchronize` - CPU waiting for GPU completion
-- `DLPack::Wrap` - Conversion to DLPack pointer
+- `GPU::Kernel` - Kernel execution
+
+The following pipeline ranges are also used where applicable:
+
+- `GPU::CopyEventRecord` - Record copy completion event
+- `GPU::H2D_Stage` - Host staging copy into pinned buffer
+- `GPU::ComputeSync` - Compute stream synchronization
## Using Profiling Macros
@@ -146,3 +120,12 @@ Source code: `qdp-core/examples/nvtx_profile.rs`
**nsys warnings:**
Warnings about CPU sampling are normal and can be ignored. They do not affect
NVTX marker recording.
+
+## Official Docs
+
+- CUDA Runtime Profiler Control:
+ https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER.html
+- Nsight Systems User Guide (v2026.1):
+ https://docs.nvidia.com/nsight-systems/UserGuide/index.html
+- Nsight Compute Profiling Guide (latest as of now):
+ https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
diff --git a/qdp/qdp-core/examples/nvtx_profile.rs
b/qdp/qdp-core/examples/nvtx_profile.rs
index 3e5c0c050..87ceeff2b 100644
--- a/qdp/qdp-core/examples/nvtx_profile.rs
+++ b/qdp/qdp-core/examples/nvtx_profile.rs
@@ -35,8 +35,11 @@ fn main() {
}
};
- // Create test data
- let data: Vec<f64> = (0..1024).map(|i| (i as f64) / 1024.0).collect();
+ // Create test data (large enough to trigger async pipeline)
+ let data_len: usize = 262_144; // 2MB of f64, exceeds async threshold
+ let data: Vec<f64> = (0..data_len)
+ .map(|i| (i as f64) / (data_len as f64))
+ .collect();
println!("✓ Created test data: {} elements", data.len());
println!();
@@ -46,11 +49,14 @@ fn main() {
println!(" - CPU::L2Norm");
println!(" - GPU::Alloc");
println!(" - GPU::H2DCopy");
+ println!(" - GPU::CopyEventRecord");
+ println!(" - GPU::H2D_Stage");
println!(" - GPU::Kernel");
+ println!(" - GPU::ComputeSync");
println!();
// Perform encoding (this will trigger NVTX markers)
- match engine.encode(&data, 10, "amplitude") {
+ match engine.encode(&data, 18, "amplitude") {
Ok(ptr) => {
println!("✓ Encoding succeeded");
println!("✓ DLPack pointer: {:p}", ptr);
diff --git a/qdp/qdp-core/src/gpu/pipeline.rs b/qdp/qdp-core/src/gpu/pipeline.rs
index 484cbc7cf..26874ab3b 100644
--- a/qdp/qdp-core/src/gpu/pipeline.rs
+++ b/qdp/qdp-core/src/gpu/pipeline.rs
@@ -125,6 +125,7 @@ impl PipelineContext {
/// `slot` must refer to a live event created by this context, and the
context must
/// remain alive until the event is no longer used by any stream.
pub unsafe fn record_copy_done(&self, slot: usize) -> Result<()> {
+ crate::profile_scope!("GPU::CopyEventRecord");
validate_event_slot(&self.events_copy_done, slot)?;
let ret = cudaEventRecord(
@@ -283,7 +284,10 @@ where
// Acquire pinned staging buffer and populate it with the current chunk
let mut pinned_buf = pinned_pool.acquire();
- pinned_buf.as_slice_mut()[..chunk.len()].copy_from_slice(chunk);
+ {
+ crate::profile_scope!("GPU::H2D_Stage");
+ pinned_buf.as_slice_mut()[..chunk.len()].copy_from_slice(chunk);
+ }
// Async copy: host to device (non-blocking, on specified stream)
// Uses CUDA Runtime API (cudaMemcpyAsync) for true async copy
@@ -335,9 +339,12 @@ where
unsafe {
ctx.sync_copy_stream()?;
}
- device
- .wait_for(&ctx.stream_compute)
- .map_err(|e| MahoutError::Cuda(format!("Compute stream sync
failed: {:?}", e)))?;
+ {
+ crate::profile_scope!("GPU::ComputeSync");
+ device
+ .wait_for(&ctx.stream_compute)
+ .map_err(|e| MahoutError::Cuda(format!("Compute stream sync
failed: {:?}", e)))?;
+ }
}
// Buffers are dropped here (after sync), freeing GPU memory