[GitHub] [arrow] westonpace commented on a change in pull request #12533: ARROW-15817: [R] Use TableSourceNode instead of InMemoryDataset

GitBox Wed, 02 Mar 2022 13:36:53 -0800


westonpace commented on a change in pull request #12533:
URL: https://github.com/apache/arrow/pull/12533#discussion_r818123668




##########
File path: r/src/compute-exec.cpp
##########
@@ -277,4 +277,18 @@ std::shared_ptr<compute::ExecNode> 
ExecNode_ReadFromRecordBatchReader(
   return MakeExecNodeOrStop("source", plan.get(), {}, options);
 }
 
+// [[arrow::export]]
+std::shared_ptr<compute::ExecNode> ExecNode_TableSourceNode(
+    const std::shared_ptr<compute::ExecPlan>& plan,
+    const std::shared_ptr<arrow::Table>& table) {
+  arrow::compute::TableSourceNodeOptions options{
+      /*table=*/table,
+      // TODO: make batch_size configurable
+      /*batch_size=*/1048576

Review comment:
       So I was doing some related testing of the table source node yesterday 
and I think we are going to change the name of this to `max_batch_size` because 
it doesn't actually concatenate small batches, it only splits up large ones.  
I'm not mentioning this to delay this PR, proceed as is and we will adjust the 
R linkage in the PR that does the rename.
   
   Mostly, I'm just looking to point out that batch size will still largely be 
controlled by the customer in how they create their table (e.g. if they create 
a table with many small batches then we will feed many small batches into the 
exec plan) and this configuration is really only a failsafe to use when the 
customer has tables with really large batches to make sure we get some 
parallelism.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #12533: ARROW-15817: [R] Use TableSourceNode instead of InMemoryDataset

Reply via email to