ryankert01 opened a new pull request, #753: URL: https://github.com/apache/mahout/pull/753
### Purpose of PR ### Refactor QDP to Support Multiple Input Types #### Problem In QDP, we had/want to support multiple input types (we now support parquet/arrow ipc, we want to add more like numpy, torch). The solution needed to: 1. Make it relatively easy to add more input types 2. Not sacrifice speed or memory Created a flexible, trait-based system that achieves all goals: #### Core Architecture - **`DataReader` trait**: Basic batch reading interface - **`StreamingDataReader` trait**: Advanced streaming for large files - **Format implementations**: Parquet (batch + streaming), Arrow IPC (batch), **NumPy (batch)** - **Placeholders**: PyTorch (with implementation guide) #### Zero Performance Impact - **Static dispatch**: No virtual function overhead - **Memory efficient**: Maintains streaming (O(1) memory for any file size) - **Zero-copy**: Direct buffer access where possible (NumPy uses `into_raw_vec_and_offset`) - **Benchmarks**: Same performance as before refactoring + **new NumPy benchmark** ### Related Issues or PRs <!-- Add links to related issues or PRs. --> <!-- - Closes #123 --> <!-- - Related to #123 --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
