alamb commented on code in PR #15355:
URL: https://github.com/apache/datafusion/pull/15355#discussion_r2009075716
##########
datafusion/physical-plan/src/sorts/sort.rs:
##########
@@ -230,9 +219,14 @@ struct ExternalSorter {
/// if `Self::in_mem_batches` are sorted
in_mem_batches_sorted: bool,
- /// If data has previously been spilled, the locations of the
- /// spill files (in Arrow IPC format)
- spills: Vec<RefCountedTempFile>,
+ /// During external sorting, in-memory intermediate data will be appended
to
+ /// this file incrementally. Once finished, this file will be moved to
[`Self::finished_spill_files`].
+ in_progress_spill_file: Option<InProgressSpillFile>,
+ /// If data has previously been spilled, the locations of the spill files
(in
+ /// Arrow IPC format)
+ /// Within the same spill file, the data might be chunked into multiple
batches,
+ /// and ordered by sort keys.
+ finished_spill_files: Vec<RefCountedTempFile>,
Review Comment:
The different semantics for different operations makes sense to me
I was thinking more mechnically, like just storing the
Vec<RefCountedTempFile>` as a field on `SortManager` and allowing Sort and
Hash, etc to access / manipulate it as required. I think it is fine to consider
this in a future PR as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]