zanmato1984 commented on code in PR #39234:
URL: https://github.com/apache/arrow/pull/39234#discussion_r1434364154
##########
cpp/src/arrow/compute/light_array.cc:
##########
@@ -395,8 +395,12 @@ int ExecBatchBuilder::NumRowsToSkip(const
std::shared_ptr<ArrayData>& column,
--num_rows_left;
int row_id_removed = row_ids[num_rows_left];
const uint32_t* offsets =
- reinterpret_cast<const uint32_t*>(column->buffers[1]->data());
+ reinterpret_cast<const uint32_t*>(column->buffers[1]->data()) +
column->offset;
num_bytes_skipped += offsets[row_id_removed + 1] -
offsets[row_id_removed];
+ // Skip consecutive rows with the same id
Review Comment:
> I don't understand what `row_ids` is or why this is needed.
In `ExecBatchBuilder::AppendSelected`, `row_ids` identifies which rows in
the specific `source` array need to be appended to the target batch. `row_ids`
is subsequently passed into `NumRowsToSkip`, basically to calculate the number
of tail rows to skip, in order to do safe (within boundary) word-to-word copy.
For more information, please see the detailed description of this bug I put
in the issue link:
https://github.com/apache/arrow/issues/32570#issuecomment-1856473812
> Would you like to update the docstring for `NumRowsToSkip` to make the
semantics more understandable?
Of course, will do.
> Also, why is `row_ids` ignored for fixed-width columns?
I also explained this in my comment in the issue
https://github.com/apache/arrow/issues/32570#issuecomment-1856473812
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]