wypoon opened a new pull request, #4588:
URL: https://github.com/apache/iceberg/pull/4588

   This is an extension of #4395.
   Here we add a custom metric for the number of delete rows that have been 
applied in a scan of a format v2 table.
   
   We introduce a counter in `BatchDataReader` and `RowDataReader`, that is 
incremented when a delete is applied. This counter is passed into 
`DeleteFilter`, and in the cases where we construct a `PositionDeleteIndex`, is 
passed into the implementation of the `PositionDeleteIndex`. In all the read 
paths, the counter is incremented whenever a delete is applied. When Spark 
calls `currentMetricsValues()` on a `PartitionReader`, which is a subclass of 
either `BatchDataReader` or `RowDataReader`, we get the current value of the 
counter and return that.
   
   Tested manually by creating a format v2 table using each of Parquet, ORC, 
and Avro files, deleting and updating rows in the tables, and reading from the 
table. The expected number of deletes show up in the Spark UI.
   Also extended existing unit tests (`DeleteReadTests`) to count the number of 
deletes applied during the scan.
   
   <img width="529" alt="Screen Shot 2022-04-19 at 8 59 31 PM" 
src="https://user-images.githubusercontent.com/3925490/164147620-7085eafd-304d-45d1-aa53-0c6029638d48.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to