milenkovicm commented on code in PR #1431:
URL:
https://github.com/apache/datafusion-ballista/pull/1431#discussion_r2750691854
##########
ballista/core/src/execution_plans/sort_shuffle/spill.rs:
##########
@@ -35,16 +35,17 @@ use std::path::PathBuf;
/// Manages spill files for sort-based shuffle.
///
/// When partition buffers exceed memory limits, they are spilled to disk
-/// as Arrow IPC files. During finalization, these spill files are read
-/// back and merged into the consolidated output file.
-#[derive(Debug)]
+/// as Arrow IPC files. Each output partition has at most one spill file
+/// that is appended to across multiple spill calls. During finalization,
+/// these spill files are read back and merged into the consolidated
+/// output file.
pub struct SpillManager {
/// Base directory for spill files
spill_dir: PathBuf,
- /// Spill files per output partition: partition_id -> Vec<spill_file_path>
- spill_files: HashMap<usize, Vec<PathBuf>>,
- /// Counter for generating unique spill file names
- spill_counter: usize,
+ /// Spill file path per output partition: partition_id -> spill_file_path
Review Comment:
So worst case scenario will be similar to current, non sort spill
implementation, having the same number of open files as output partitions?
##########
ballista/core/src/config.rs:
##########
@@ -69,6 +69,9 @@ pub const BALLISTA_SHUFFLE_SORT_BASED_MEMORY_LIMIT: &str =
/// Configuration key for sort shuffle spill threshold (0.0-1.0).
pub const BALLISTA_SHUFFLE_SORT_BASED_SPILL_THRESHOLD: &str =
"ballista.shuffle.sort_based.spill_threshold";
+/// Configuration key for sort shuffle target batch size in rows.
+pub const BALLISTA_SHUFFLE_SORT_BASED_BATCH_SIZE: &str =
+ "ballista.shuffle.sort_based.batch_size";
Review Comment:
Is there any specific reason we don't use data fusions batch size
configuration?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]