[GitHub] [arrow] bkietz commented on a diff in pull request #14386: ARROW-18004: [C++] ExecBatch conversion to RecordBatch may go out of bounds

GitBox Thu, 13 Oct 2022 07:47:06 -0700


bkietz commented on code in PR #14386:
URL: https://github.com/apache/arrow/pull/14386#discussion_r994747026



##########
cpp/src/arrow/compute/exec.cc:
##########
@@ -156,15 +156,23 @@ Result<ExecBatch> ExecBatch::Make(std::vector<Datum> 
values) {
 
 Result<std::shared_ptr<RecordBatch>> ExecBatch::ToRecordBatch(
     std::shared_ptr<Schema> schema, MemoryPool* pool) const {
+  if (static_cast<size_t>(schema->num_fields()) > values.size()) {
+    return Status::Invalid("ExecBatch::ToTRecordBatch mismatching schema 
size");
+  }
   ArrayVector columns(schema->num_fields());
 
   for (size_t i = 0; i < columns.size(); ++i) {
     const Datum& value = values[i];
     if (value.is_array()) {
       columns[i] = value.make_array();
       continue;
+    } else if (value.is_scalar()) {

Review Comment:
   RecordBatch requires Array columns. ExecBatch also supports for Scalars, 
which indicate a column with constant value. Such constant columns arise 
frequently (for example when the dataset has been partitioned on a column so a 
batch read from `cyl=4/dat.parquet` has a constant `cyl` column) and many 
kernels can take directly accept a scalar argument (for example the arithmetic 
kernels can add a scalar to an array), saving memory and processor overhead of 
broadcasting to Array.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a diff in pull request #14386: ARROW-18004: [C++] ExecBatch conversion to RecordBatch may go out of bounds

Reply via email to