alamb commented on code in PR #15405:
URL: https://github.com/apache/datafusion/pull/15405#discussion_r2012559291


##########
datafusion/physical-plan/src/aggregates/row_hash.rs:
##########
@@ -981,28 +978,30 @@ impl GroupedHashAggregateStream {
         Ok(())
     }
 
-    /// Emit all rows, sort them, and store them on disk.
+    /// Emit all intermediate aggregation states, sort them, and store them on 
disk.
+    /// This process helps in reducing memory pressure by allowing the data to 
be
+    /// read back with streaming merge.
     fn spill(&mut self) -> Result<()> {
+        // Emit and sort intermediate aggregation state
         let Some(emit) = self.emit(EmitTo::All, true)? else {
             return Ok(());
         };
         let sorted = sort_batch(&emit, self.spill_state.spill_expr.as_ref(), 
None)?;

Review Comment:
   eventually it might make sense to have the spill manager handle sorting the 
runs too (so it could potentially merge multiple files into a single run to 
reduce fanout, etc



##########
datafusion/physical-plan/src/aggregates/row_hash.rs:
##########
@@ -981,28 +978,30 @@ impl GroupedHashAggregateStream {
         Ok(())
     }
 
-    /// Emit all rows, sort them, and store them on disk.
+    /// Emit all intermediate aggregation states, sort them, and store them on 
disk.

Review Comment:
   ❤️ 



##########
datafusion/physical-plan/src/spill.rs:
##########
@@ -92,6 +50,10 @@ fn read_spill(sender: Sender<Result<RecordBatch>>, path: 
&Path) -> Result<()> {
 
 /// Spill the `RecordBatch` to disk as smaller batches
 /// split by `batch_size_rows`
+#[deprecated(
+    since = "46.0.0",
+    note = "This method is deprecated. Use 
`SpillManager::spill_record_batch_by_size` instead."

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to