Re: [PR] Support SortMergeJoin spilling [datafusion]

via GitHub Sat, 13 Jul 2024 18:32:35 -0700


comphead commented on code in PR #11218:
URL: https://github.com/apache/datafusion/pull/11218#discussion_r1676989856



##########
datafusion/physical-plan/src/lib.rs:
##########
@@ -852,6 +852,30 @@ pub fn spill_record_batches(
     Ok(writer.num_rows)
 }
 
+/// Spill the `RecordBatch` to disk as smaller batches
+/// split by `batch_size`
+/// Return `total_rows` what is spilled
+pub fn spill_record_batch_by_size(
+    batch: RecordBatch,
+    path: PathBuf,
+    schema: SchemaRef,
+    batch_size: usize,
+) -> Result<usize, DataFusionError> {

Review Comment:
   Exactly the idea behind is to make sub batches to help the consumer reading 
data using less memory. The same approach we use in `row_hash.rs`
   Ideally to be honest is to return stream `SendableBatchRecordStream` but 
this will require more efforts as the SMJ is not async for now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support SortMergeJoin spilling [datafusion]

Reply via email to