zanmato1984 commented on code in PR #39234:
URL: https://github.com/apache/arrow/pull/39234#discussion_r1434364154


##########
cpp/src/arrow/compute/light_array.cc:
##########
@@ -395,8 +395,12 @@ int ExecBatchBuilder::NumRowsToSkip(const 
std::shared_ptr<ArrayData>& column,
       --num_rows_left;
       int row_id_removed = row_ids[num_rows_left];
       const uint32_t* offsets =
-          reinterpret_cast<const uint32_t*>(column->buffers[1]->data());
+          reinterpret_cast<const uint32_t*>(column->buffers[1]->data()) + 
column->offset;
       num_bytes_skipped += offsets[row_id_removed + 1] - 
offsets[row_id_removed];
+      // Skip consecutive rows with the same id

Review Comment:
   > I don't understand what `row_ids` is or why this is needed.
   
   In `ExecBatchBuilder::AppendSelected`, `row_ids` identifies which rows in 
the specific `source` array need to be appended to the target batch. This is 
particularly a common operation in hash join, when we have probed the matching 
rows by comparing join keys, we'll collect matching rows for each probe side 
columns, according to the matching row ids. Note that matching row ids may 
contain multiple occurrences of a same row, and the issue rises when the last 
matching row has multiple occurrences. This is the case I reproduced in the UT 
in this PR.
   
   `row_ids` is subsequently passed into `NumRowsToSkip`, basically to 
calculate the number of tail rows to skip, in order to do safe (within 
boundary) word-to-word copy.
   
   For more information, please see the detailed description of this bug I put 
in the issue link: 
https://github.com/apache/arrow/issues/32570#issuecomment-1856473812
   
   > Would you like to update the docstring for `NumRowsToSkip` to make the 
semantics more understandable?
   
   Of course, will do.
   
   > Also, why is `row_ids` ignored for fixed-width columns?
   
   I also explained this in my comment in the issue 
https://github.com/apache/arrow/issues/32570#issuecomment-1856473812



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to