2010YOUY01 commented on code in PR #15700:
URL: https://github.com/apache/datafusion/pull/15700#discussion_r2041629246


##########
datafusion/physical-plan/src/sorts/sort.rs:
##########
@@ -431,12 +422,16 @@ impl ExternalSorter {
         let batches_to_spill = std::mem::take(globally_sorted_batches);
         self.reservation.free();
 
-        let in_progress_file = 
self.in_progress_spill_file.as_mut().ok_or_else(|| {
-            internal_datafusion_err!("In-progress spill file should be 
initialized")
-        })?;
+        let (in_progress_file, max_record_batch_size) =
+            self.in_progress_spill_file.as_mut().ok_or_else(|| {
+                internal_datafusion_err!("In-progress spill file should be 
initialized")
+            })?;
 
         for batch in batches_to_spill {
             in_progress_file.append_batch(&batch)?;
+
+            *max_record_batch_size =
+                (*max_record_batch_size).max(batch.get_actually_used_size());

Review Comment:
   There might be some type of arrays with complex internal buffer management, 
a simple example is:
   Before spilling an `StringView` array has 10MB actual content, backed by 3 * 
4MB buffer.
   After spilling and reading back, the reader implementation decided to use 1 
* 16MB buffer instead.
   Different allocation policies caused different fragmentation status, and 
physical memory consumed varies.
   
   Here are some real bugs found recently due to similar reasons (this explains 
why I'm worried about inconsistent memory size for logically equivalent 
batches):
   https://github.com/apache/datafusion/pull/14644
   https://github.com/apache/datafusion/pull/14823
   https://github.com/apache/datafusion/pull/13377
   Note they're only caused by primitive and string arrays, for more complex 
types like struct, array, or other nested types, I think it's even more likely 
to see such inconsistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to