Re: [PR] Add `SessionContext::record_batches` [arrow-datafusion]

via GitHub Tue, 13 Feb 2024 03:10:07 -0800


alamb commented on code in PR #9197:
URL: https://github.com/apache/arrow-datafusion/pull/9197#discussion_r1487644594



##########
datafusion/core/src/execution/context/mod.rs:
##########
@@ -934,7 +935,30 @@ impl SessionContext {
             .build()?,
         ))
     }
-
+    /// Create a [`DataFrame`] for reading a [`Vec[`RecordBatch`]`]
+    pub fn read_batches(
+        &self,
+        batches: impl IntoIterator<Item = RecordBatch>,
+    ) -> Result<DataFrame> {
+        // check schema uniqueness
+        let mut batches = batches.into_iter().peekable();
+        let schema = if let Some(batch) = batches.peek() {
+            batch.schema().clone()
+        } else {
+            Arc::new(Schema::empty())
+        };
+        let provider =
+            MemTable::try_new(schema, batches.map(|batch| 
vec![batch]).collect())?;

Review Comment:
   There is a difference (as you say the code in this PR makes its own 
partition). I think you are right that a single partition might be better (and 
DataFusion will repartition the plan into multiple partitions) if necessary
   
   Is this something you can do @Lordworms ? Otherwise we can merge this PR as 
is and make it as a follow on change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `SessionContext::record_batches` [arrow-datafusion]

Reply via email to