wypoon commented on code in PR #4588:
URL: https://github.com/apache/iceberg/pull/4588#discussion_r950768444
##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -183,6 +187,8 @@ Pair<int[], Integer>
buildPosDelRowIdMapping(PositionDeleteIndex deletedRowPosit
currentRowId++;
} else if (hasIsDeletedColumn) {
isDeleted[originalRowId] = true;
+ } else {
+ deletes.incrementDeleteCount();
}
Review Comment:
@flyrain thank you for explaining how reads with `hasIsDeletedColumn` occur.
I didn't really understand that before, but based on @RussellSpitzer's
comment I cited, I figured it was fine not to count deletes for such reads. I
have done this consistently for both vectorized and non-vectorized reads, as
you realize. This is deliberate.
Now that I understand the use case for `hasIsDeletedColumn`, I agree that
there is merit to the argument for considering "number of row deletes applied"
metric to apply regardless of `hasIsDeletedColumn`. On the other hand, for the
queries where `hasIsDeletedColumn` is true, you are able to infer the number of
row deletes applied from the result. I don't have a strong opinion on what to
do in the `hasIsDeletedColumn` scenario. But the current implementation is
consistent. @RussellSpitzer do you want to weigh in?
If we want to count the deletes always, we'll need to modify
`DeleteFilter#markDeleted` (we'll need to pass in the `DeleteCounter`) as well
as the changes you mention for `ColumnarBatchReader`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]