andygrove commented on PR #1389:
URL: 
https://github.com/apache/datafusion-ballista/pull/1389#issuecomment-3765473575

   Thanks for the review @milenkovicm! I've addressed the feedback in the 
following commits:
   
   ## Addressed
   
   ### 1. Return stream instead of `Vec<RecordBatch>` (d4fcf1a7)
   Added `stream_sort_shuffle_partition()` that returns a 
`SendableRecordBatchStream` instead of collecting all batches into a vector. 
This reads batches lazily on-demand, reducing memory usage for large partitions.
   
   ### 2. Documentation improvements (0445c48c)
   - Added the Spark algorithm description: *"Internally, results from 
individual map tasks are kept in memory until they can't fit. Then, these are 
sorted based on the target partition and written to a single file. On the 
reduce side, tasks read the relevant sorted blocks."*
   - Fixed wording: "that writes" → "It writes"
   
   ### 3. Flight service support for remote reads (53b5d571)
   - Updated `do_get` to detect sort-based shuffle files and read only the 
requested partition using the index
   - Added detection in `IO_BLOCK_TRANSPORT` to return an informative error 
(block transport can't support sort-based shuffle since it transfers the entire 
file containing all partitions)
   
   ### 4. Parameterized tests with rstest (a297d475)
   Added remote flight read tests using `rstest` to parameterize all 15 sort 
shuffle tests. Tests now run in two modes:
   - `Local`: reads shuffle data from local filesystem
   - `RemoteFlight`: forces remote read through flight service with 
`BALLISTA_SHUFFLE_READER_FORCE_REMOTE_READ=true` and 
`BALLISTA_SHUFFLE_READER_REMOTE_PREFER_FLIGHT=true`
   
   All 32 tests pass (15 tests × 2 modes + 2 comparison tests).
   
   ## Notes
   
   - **Streaming for `spill.rs`**: Evaluated but deferred - batches flow 
through quickly during finalization so memory impact is less critical
   - **Block transport**: Returns `Status::unimplemented` for sort-based 
shuffle with a message directing users to enable flight transport


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to