gaborkaszab commented on PR #14508: URL: https://github.com/apache/iceberg/pull/14508#issuecomment-3491094392
Some background: The current the way to query partition stats is through `PartitionStatsHandler.readPartitionStatsFile()`. For the the user has to put together the schema and get the input file to read. It would be beneficial for easier usability (also one comment on [my stats proposal doc](https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM/edit?pli=1&disco=AAABoDRdYcw) mentions) to have a more convenient API to scan partition stats. This could also have filter and projection capabilities. The content of this PR: 1) Introduce `PartitionStatisticsScan` API and its implementation `BasePartitionStatisticsScan` in core. For simplicity this has the functionality that exists today, no filtering by partition, no projection. 2) Replace the usage of `PartitionStatsHandler.readPartitionStatsFile()` with the new API 3) Introduce PartitionStatistics interface into the API module, make PartitionStats in core to derive from this. This is needed so that the Scan API could use this as return value, while the existing PartitionStats class is in core module. 4) Replace the usage of PartitionStats whenever possible with the new interface. These could possibly be some follow-up steps: 1) Implementation of filter() and project() on the new Scan API 2) The naming of affected classes is a bit weird: interface `api/PartitionStatistics` that is implemented by `core/PartitionStats`. Ideally the name of the implementation would be `BasePartitionStatistics`. As a next step we can introduce a class with the same content and new name and deprecate the existing one, also remove usage. Changes within PartitionStats are easier to review in case "renaming" happens in a follow-up PR. 3) Older Spark versions should be covered -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
