ryankert01 opened a new pull request, #753:
URL: https://github.com/apache/mahout/pull/753

   ### Purpose of PR
   ### Refactor QDP to Support Multiple Input Types 
   
   #### Problem
   In QDP, we had/want to support multiple input types (we now support 
parquet/arrow ipc, we want to add more like numpy, torch). The solution needed 
to:
   1. Make it relatively easy to add more input types
   2. Not sacrifice speed or memory
   
   Created a flexible, trait-based system that achieves all goals:
   
   #### Core Architecture
   - **`DataReader` trait**: Basic batch reading interface
   - **`StreamingDataReader` trait**: Advanced streaming for large files
   - **Format implementations**: Parquet (batch + streaming), Arrow IPC 
(batch), **NumPy (batch)** 
   - **Placeholders**: PyTorch (with implementation guide)
   
   #### Zero Performance Impact
   - **Static dispatch**: No virtual function overhead
   - **Memory efficient**: Maintains streaming (O(1) memory for any file size)
   - **Zero-copy**: Direct buffer access where possible (NumPy uses 
`into_raw_vec_and_offset`)
   - **Benchmarks**: Same performance as before refactoring + **new NumPy 
benchmark** 
   
   
   ### Related Issues or PRs
   <!-- Add links to related issues or PRs. -->
   <!-- - Closes #123  -->
   <!-- - Related to #123   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to