[GitHub] [iceberg] wypoon opened a new pull request, #4588: Spark: Add custom metric for number of deletes applied by a SparkScan

GitBox Fri, 19 Aug 2022 17:58:36 -0700


wypoon opened a new pull request, #4588:
URL: https://github.com/apache/iceberg/pull/4588

This is an extension of #4395.
Here we add a custom metric for the number of delete rows that have been
applied in a scan of a format v2 table.

We introduce a counter in `BatchDataReader` and `RowDataReader`, that is
incremented when a delete is applied. This counter is passed into
`DeleteFilter`, and in the cases where we construct a `PositionDeleteIndex`, is
passed into the implementation of the `PositionDeleteIndex`. In all the read
paths, the counter is incremented whenever a delete is applied. When Spark
calls `currentMetricsValues()` on a `PartitionReader`, which is a subclass of
either `BatchDataReader` or `RowDataReader`, we get the current value of the
counter and return that.

Tested manually by creating a format v2 table using each of Parquet, ORC,
and Avro files, deleting and updating rows in the tables, and reading from the
table. The expected number of deletes show up in the Spark UI.
Also extended existing unit tests (`DeleteReadTests`) to count the number of
deletes applied during the scan.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon opened a new pull request, #4588: Spark: Add custom metric for number of deletes applied by a SparkScan

Reply via email to