[GitHub] [arrow-datafusion] kazuyukitanimura commented on a diff in pull request #7400: feat: Support spilling for hash aggregation

via GitHub Thu, 14 Sep 2023 14:51:20 -0700


kazuyukitanimura commented on code in PR #7400:
URL: https://github.com/apache/arrow-datafusion/pull/7400#discussion_r1326550421



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -466,15 +625,122 @@ impl GroupedHashAggregateStream {
         for acc in self.accumulators.iter_mut() {
             match self.mode {
                 AggregateMode::Partial => output.extend(acc.state(emit_to)?),
+                _ if spilling => {
+                    // If spilling, output partial state because the spilled 
data will be
+                    // merged and re-evaluated later.
+                    output.extend(acc.state(emit_to)?)
+                }
                 AggregateMode::Final
                 | AggregateMode::FinalPartitioned
                 | AggregateMode::Single
                 | AggregateMode::SinglePartitioned => 
output.push(acc.evaluate(emit_to)?),
             }
         }
 
-        self.update_memory_reservation()?;
-        let batch = RecordBatch::try_new(self.schema(), output)?;
+        // emit reduces the memory usage. Ignore Err from 
update_memory_reservation. Even if it is
+        // over the target memory size after emission, we can emit again 
rather than returning Err.
+        let _ = self.update_memory_reservation();
+        let batch = RecordBatch::try_new(schema, output)?;
         Ok(batch)
     }
+
+    /// Optimistically, [`Self::group_aggregate_batch`] allows to exceed the 
memory target slightly
+    /// (~ 1 [`RecordBatch`]) for simplicity. In such cases, spill the data to 
disk and clear the
+    /// memory. Currently only [`GroupOrdering::None`] is supported for 
spilling.
+    fn spill_previous_if_necessary(&mut self, batch: &RecordBatch) -> 
Result<()> {
+        // TODO: support group_ordering for spilling
+        if self.group_values.len() > 0
+            && batch.num_rows() > 0
+            && matches!(self.group_ordering, GroupOrdering::None)
+            && !matches!(self.mode, AggregateMode::Partial)
+            && !self.spill_state.is_stream_merging
+            && self.update_memory_reservation().is_err()
+        {
+            // Use input batch (Partial mode) schema for spilling because
+            // the spilled data will be merged and re-evaluated later.
+            self.spill_state.spill_schema = batch.schema();
+            self.spill()?;
+            self.clear_shrink(batch);
+        }
+        Ok(())
+    }
+
+    /// Emit all rows, sort them, and store them on disk.
+    fn spill(&mut self) -> Result<()> {
+        let emit = self.emit(EmitTo::All, true)?;
+        let sorted = sort_batch(&emit, &self.spill_state.spill_expr, None)?;

Review Comment:
   > ... so they are continuously aggregated into final aggregation result.
   
   During this re-grouping time, we may run out of memory again. The only 
option is to streaming read and early outputting. In order to do so, the input 
for regrouping has to be sorted. Otherwise we cannot output the final 
aggregation results early and that makes the hash table keep all rows in 
memory. E.g. `a = 2` can be in the very first batch as well as in the very 
last. So the last re-grouping step has to be a sort based aggregation in order 
to avoid OOM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kazuyukitanimura commented on a diff in pull request #7400: feat: Support spilling for hash aggregation

Reply via email to