[GitHub] [iceberg] wypoon commented on a diff in pull request #4588: Spark: Add custom metric for number of deletes applied by a SparkScan

GitBox Sat, 20 Aug 2022 18:25:23 -0700


wypoon commented on code in PR #4588:
URL: https://github.com/apache/iceberg/pull/4588#discussion_r950768444



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -183,6 +187,8 @@ Pair<int[], Integer> 
buildPosDelRowIdMapping(PositionDeleteIndex deletedRowPosit
           currentRowId++;
         } else if (hasIsDeletedColumn) {
           isDeleted[originalRowId] = true;
+        } else {
+          deletes.incrementDeleteCount();
         }

Review Comment:
   @flyrain thank you for explaining how reads with `hasIsDeletedColumn` occur. 
   I didn't really understand that before, but based on @RussellSpitzer's 
comment I cited, I figured it was fine not to count deletes for such reads. I 
have done this consistently for both vectorized and non-vectorized reads, as 
you realize. This is deliberate.
   
   Now that I understand the use case for `hasIsDeletedColumn`, I agree that 
there is merit to the argument for considering "number of row deletes applied" 
metric to apply regardless of `hasIsDeletedColumn`. On the other hand, for the 
queries where `hasIsDeletedColumn` is true, you are able to infer the number of 
row deletes applied from the result. I don't have a strong opinion on what to 
do in the `hasIsDeletedColumn` scenario. But the current implementation is 
consistent. @RussellSpitzer do you want to weigh in? 
   
   If we want to count the deletes always, we'll need to modify 
`DeleteFilter#markDeleted` (we'll need to pass in the `DeleteCounter`) as well 
as the changes you mention for `ColumnarBatchReader`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon commented on a diff in pull request #4588: Spark: Add custom metric for number of deletes applied by a SparkScan

Reply via email to