400Ping opened a new issue, #993: URL: https://github.com/apache/mahout/issues/993
## Summary This roadmap defines the implementation path for **streaming and large-data support** across all QDP encodings: adding **IQP / IQP-Z** to the Parquet streaming pipeline, introducing **additional input formats** (e.g. chunked NumPy, HDF5), and completing **documentation and baselines** so that “encode from file” is a first-class workflow. It is scoped to be comparable in impact to [Pipeline Tuning #969](https://github.com/apache/mahout/issues/969), but focuses on **feature coverage and input ecosystem** rather than pipeline performance. ## Motivation - **Gap:** `encode_from_parquet()` currently supports only `amplitude`, `angle`, and `basis`. IQP and IQP-Z have kernels and in-memory `encode()` / `encode_batch()`, but **no streaming path** from Parquet or other large files. - **Goal:** Enable all encodings (including IQP) to use the existing dual-stream pipeline from Parquet and, in later phases, from other large-data sources. - **Non-overlap with #969:** #969 addresses pipeline **performance** (observability, chunk/pool tuning, event-based buffer reuse). This roadmap addresses **who can run** and **from where data is read** (streaming encodings + input formats + docs). --- ## Phase 1: IQP / IQP-Z streaming encoding **Deliverables** - Support `encode_from_parquet(path, num_qubits, "iqp" | "iqp-z")` so that large Parquet files are processed through the existing dual-stream pipeline. - Unit and integration tests; optional small throughput benchmark for Parquet + IQP. **Implementation outline** 1. **Add IQP `ChunkEncoder`** in `qdp-core/src/encoding/` (following the pattern of amplitude/angle/basis): - Implement `ChunkEncoder`: `validate_sample_size`, `needs_staging_copy`, `init_state`, `encode_chunk`. - IQP full: `sample_size = num_qubits + num_qubits*(num_qubits-1)/2`; IQP-Z: `num_qubits`. - Reuse kernel calls and length checks from `qdp-core/src/gpu/encodings/iqp.rs`. 2. **Wire into `encode_from_parquet()`** in `encoding/mod.rs`: add branches for `"iqp"` and `"iqp-z"` calling `stream_encode` with the appropriate IQP encoder variant. 3. **Tests:** Reuse logic from `tests/iqp_encoding.rs`; add integration test that reads a small Parquet file and runs stream encode for IQP/IQP-Z. **Key files:** `qdp-core/src/encoding/mod.rs`, new or extended `encoding/iqp.rs` (streaming), `qdp-core/src/gpu/encodings/iqp.rs` (existing), `qdp-core/tests/`. --- ## Phase 2: Additional input formats (streaming readers) **Deliverables** - At least one large-data–friendly streaming reader implemented and plugged into the encoding pipeline. - Candidates (from [readers README](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md) Future Enhancements): **chunked NumPy** (large `.npy`), or **HDF5**. **Implementation outline** 1. **Implement a new reader** satisfying `StreamingDataReader` in `qdp-core/src/reader.rs` (`read_chunk(&mut self, buffer: &mut [f64]) -> Result<usize>`). 2. **Integrate with encoding:** Either extend `encode_from_*` to accept the new reader (or select by path/extension) so that `stream_encode` can consume data from the new source. 3. **Tests and docs:** Unit tests for the new reader; at least one end-to-end test (e.g. amplitude or IQP from the new format). Update `qdp/docs/readers/README.md`. **Key files:** `qdp-core/src/readers/`, `qdp-core/src/reader.rs`, `qdp/docs/readers/README.md`. --- ## Phase 3: Baselines and documentation **Deliverables** - Reproducible throughput description or benchmark flow for “large file + all encodings (including IQP)”. - Complete **Getting Started** and **Examples** for QDP (currently TODO in the docs), making “encode from file” a first-class documented workflow. **Implementation outline** 1. **Benchmark:** Define and document a small workflow (e.g. in `qdp-python/benchmark/` or `qdp/docs/`) for Parquet + amplitude/angle/basis/iqp; align with #969 Phase 2 baseline methodology where useful. 2. **Docs:** - **Getting Started:** Install, minimal example, typical `encode` / `encode_from_parquet` usage (including IQP). - **Examples:** 2–3 full examples (e.g. in-memory amplitude, Parquet + IQP, DLPack → PyTorch). - Optionally: short API summary in the QDP API doc. 3. **Relationship to #969:** Reuse #969 Phase 2 observability/baseline flow if available, to avoid duplicate tooling. --- ## Phase order and dependencies - **Phase 1** is independent; only depends on the current pipeline and IQP kernels. - **Phase 2** builds on the same `stream_encode` interface (can be parallelized with Phase 1 once reader integration is agreed). - **Phase 3** can be done in parallel with Phase 1/2; the “large file + IQP” benchmark is most meaningful after Phase 1 is merged. **Suggested order:** Land Phase 1 first, then Phase 2; Phase 3 docs can start early, with benchmark steps finalized after Phase 1. --- ## Alternatives considered - **Only document current behavior:** Does not address the missing IQP streaming path or additional formats. - **Single big PR:** Phased approach allows incremental review and reduces risk. --- ## Additional context - IQP kernel and GPU encoding already exist: `qdp-kernels/src/iqp.cu`, `qdp-core/src/gpu/encodings/iqp.rs`, and `qdp-core/tests/iqp_encoding.rs`. - Streaming pipeline and `ChunkEncoder` are in `qdp-core/src/encoding/` (amplitude, angle, basis); `encode_from_parquet` is in `encoding/mod.rs`. - Readers design: [qdp/docs/readers/README.md](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
