400Ping opened a new issue, #993:
URL: https://github.com/apache/mahout/issues/993

   ## Summary
   
   This roadmap defines the implementation path for **streaming and large-data 
support** across all QDP encodings: adding **IQP / IQP-Z** to the Parquet 
streaming pipeline, introducing **additional input formats** (e.g. chunked 
NumPy, HDF5), and completing **documentation and baselines** so that “encode 
from file” is a first-class workflow. It is scoped to be comparable in impact 
to [Pipeline Tuning #969](https://github.com/apache/mahout/issues/969), but 
focuses on **feature coverage and input ecosystem** rather than pipeline 
performance.
   
   ## Motivation
   
   - **Gap:** `encode_from_parquet()` currently supports only `amplitude`, 
`angle`, and `basis`. IQP and IQP-Z have kernels and in-memory `encode()` / 
`encode_batch()`, but **no streaming path** from Parquet or other large files.
   - **Goal:** Enable all encodings (including IQP) to use the existing 
dual-stream pipeline from Parquet and, in later phases, from other large-data 
sources.
   - **Non-overlap with #969:** #969 addresses pipeline **performance** 
(observability, chunk/pool tuning, event-based buffer reuse). This roadmap 
addresses **who can run** and **from where data is read** (streaming encodings 
+ input formats + docs).
   
   ---
   
   ## Phase 1: IQP / IQP-Z streaming encoding
   
   **Deliverables**
   
   - Support `encode_from_parquet(path, num_qubits, "iqp" | "iqp-z")` so that 
large Parquet files are processed through the existing dual-stream pipeline.
   - Unit and integration tests; optional small throughput benchmark for 
Parquet + IQP.
   
   **Implementation outline**
   
   1. **Add IQP `ChunkEncoder`** in `qdp-core/src/encoding/` (following the 
pattern of amplitude/angle/basis):
      - Implement `ChunkEncoder`: `validate_sample_size`, `needs_staging_copy`, 
`init_state`, `encode_chunk`.
      - IQP full: `sample_size = num_qubits + num_qubits*(num_qubits-1)/2`; 
IQP-Z: `num_qubits`.
      - Reuse kernel calls and length checks from 
`qdp-core/src/gpu/encodings/iqp.rs`.
   2. **Wire into `encode_from_parquet()`** in `encoding/mod.rs`: add branches 
for `"iqp"` and `"iqp-z"` calling `stream_encode` with the appropriate IQP 
encoder variant.
   3. **Tests:** Reuse logic from `tests/iqp_encoding.rs`; add integration test 
that reads a small Parquet file and runs stream encode for IQP/IQP-Z.
   
   **Key files:** `qdp-core/src/encoding/mod.rs`, new or extended 
`encoding/iqp.rs` (streaming), `qdp-core/src/gpu/encodings/iqp.rs` (existing), 
`qdp-core/tests/`.
   
   ---
   
   ## Phase 2: Additional input formats (streaming readers)
   
   **Deliverables**
   
   - At least one large-data–friendly streaming reader implemented and plugged 
into the encoding pipeline.
   - Candidates (from [readers 
README](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md) 
Future Enhancements): **chunked NumPy** (large `.npy`), or **HDF5**.
   
   **Implementation outline**
   
   1. **Implement a new reader** satisfying `StreamingDataReader` in 
`qdp-core/src/reader.rs` (`read_chunk(&mut self, buffer: &mut [f64]) -> 
Result<usize>`).
   2. **Integrate with encoding:** Either extend `encode_from_*` to accept the 
new reader (or select by path/extension) so that `stream_encode` can consume 
data from the new source.
   3. **Tests and docs:** Unit tests for the new reader; at least one 
end-to-end test (e.g. amplitude or IQP from the new format). Update 
`qdp/docs/readers/README.md`.
   
   **Key files:** `qdp-core/src/readers/`, `qdp-core/src/reader.rs`, 
`qdp/docs/readers/README.md`.
   
   ---
   
   ## Phase 3: Baselines and documentation
   
   **Deliverables**
   
   - Reproducible throughput description or benchmark flow for “large file + 
all encodings (including IQP)”.
   - Complete **Getting Started** and **Examples** for QDP (currently TODO in 
the docs), making “encode from file” a first-class documented workflow.
   
   **Implementation outline**
   
   1. **Benchmark:** Define and document a small workflow (e.g. in 
`qdp-python/benchmark/` or `qdp/docs/`) for Parquet + 
amplitude/angle/basis/iqp; align with #969 Phase 2 baseline methodology where 
useful.
   2. **Docs:**  
      - **Getting Started:** Install, minimal example, typical `encode` / 
`encode_from_parquet` usage (including IQP).  
      - **Examples:** 2–3 full examples (e.g. in-memory amplitude, Parquet + 
IQP, DLPack → PyTorch).  
      - Optionally: short API summary in the QDP API doc.
   3. **Relationship to #969:** Reuse #969 Phase 2 observability/baseline flow 
if available, to avoid duplicate tooling.
   
   ---
   
   ## Phase order and dependencies
   
   - **Phase 1** is independent; only depends on the current pipeline and IQP 
kernels.
   - **Phase 2** builds on the same `stream_encode` interface (can be 
parallelized with Phase 1 once reader integration is agreed).
   - **Phase 3** can be done in parallel with Phase 1/2; the “large file + IQP” 
benchmark is most meaningful after Phase 1 is merged.
   
   **Suggested order:** Land Phase 1 first, then Phase 2; Phase 3 docs can 
start early, with benchmark steps finalized after Phase 1.
   
   ---
   
   ## Alternatives considered
   
   - **Only document current behavior:** Does not address the missing IQP 
streaming path or additional formats.
   - **Single big PR:** Phased approach allows incremental review and reduces 
risk.
   
   ---
   
   ## Additional context
   
   - IQP kernel and GPU encoding already exist: `qdp-kernels/src/iqp.cu`, 
`qdp-core/src/gpu/encodings/iqp.rs`, and `qdp-core/tests/iqp_encoding.rs`.
   - Streaming pipeline and `ChunkEncoder` are in `qdp-core/src/encoding/` 
(amplitude, angle, basis); `encode_from_parquet` is in `encoding/mod.rs`.
   - Readers design: 
[qdp/docs/readers/README.md](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to