[I] [QDP] Ray/KubeRay Integration for multi-node multi-GPU encoding Roadmap [mahout]

via GitHub Thu, 05 Feb 2026 00:20:20 -0800


400Ping opened a new issue, #1016:
URL: https://github.com/apache/mahout/issues/1016


   ## Summary
   
   This roadmap introduces an **optional Ray/KubeRay integration layer** for 
Mahout QDP to enable **multi-node, multi-GPU (distributed data-parallel)** 
encoding. The integration supports both **Parquet-first** (Ray Data) and 
**Tensor-first** (NumPy/Torch batch) entry points, with **externalized 
outputs** (write sharded artifacts + manifest; return paths/metadata) to avoid 
moving huge tensors through Ray’s object store.
   
   ## Motivation
   
   - QDP is GPU-accelerated and already supports file-based inputs (e.g., 
Parquet) via Python bindings.
   - Users often have GPUs across multiple hosts; Ray provides proven 
scheduling, GPU resource management, and fault tolerance.
   - A real workload (QDP encoding) also creates upstream engineering 
opportunities in Ray Core/Data and KubeRay.
   
   ## Design principles
   
   - Ray integration is **optional** (extra dependency), not required for core 
QDP usage.
   - Use **actor-per-GPU** to reuse CUDA context/native extension 
initialization.
   - Prefer **external outputs** (write shards + manifest) over returning large 
tensors to the driver/object store.
   - Use **bounded in-flight** submission for backpressure and memory safety.
   
   ## Phase 0: Define the integration contract
   
   Deliverables:
   - Input shard contract (paths/metadata) and output shard contract.
   - A stable `manifest` schema capturing params, commit hash, shard outputs, 
and timings.
   
   ## Phase 1: Ray Core actor-per-GPU pipeline (single-node)
   
   Deliverables:
   - `QdpWorkerActor` (`num_gpus=1`) that initializes `QdpEngine(device_id=0)` 
once and encodes assigned shards.
   - Driver that uses bounded in-flight scheduling.
   - External output writing + manifest generation.
   
   ## Phase 2: Multi-node (2 hosts / 2 GPUs) scaling + fault tolerance
   
   Deliverables:
   - Run end-to-end on two hosts (each 1 GPU) and publish baseline throughput 
and scaling behavior.
   - Define and implement idempotent shard output (retry-safe) and a commit 
protocol.
   
   ## Phase 3: Ray Data integration (Parquet-first + Tensor-first)
   
   Deliverables:
   - Parquet-first: `read_parquet -> partition/shard -> GPU actors -> write 
shards -> refs dataset`.
   - Tensor-first: `map_batches` / actor interface for NumPy/Torch batches 
using `encode_batch`.
   - Benchmarks comparing task-per-batch vs actor reuse.
   
   ## Phase 4: KubeRay operationalization
   
   Deliverables:
   - Reference deployment (RayJob/RayService) with GPUs and native extension 
packaging strategy.
   - Shared storage setup for shard outputs.
   - (Optional) E2E validation test.
   
   ## Phase 5: Connect to QDP feature roadmap
   
   Deliverables:
   - Integrate improvements such as IQP streaming from Parquet and additional 
readers into the Ray/KubeRay pipeline.
   - Re-run baselines to quantify improvements.
   
   ## Success criteria (initial)
   
   - 2-node / 2-GPU pipeline runs reliably and produces correct shard outputs + 
manifest.
   - Throughput improves vs 1-node baseline with reasonable scaling.
   - Retry does not corrupt outputs (idempotent / commit-safe).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [QDP] Ray/KubeRay Integration for multi-node multi-GPU encoding Roadmap [mahout]

Reply via email to