szehon-ho opened a new issue #2326: URL: https://github.com/apache/iceberg/issues/2326
I was struggling the other day with listing partitions of a table with 4000 manifest files, it takes ~10 minutes on S3, which is a bit of a shock for people used to the speed of querying Hive metastore. While it is possible to optimize the metadata, was chatting a bit with @RussellSpitzer about this issue, and seems the way to go is making the reading of PartitionTable a proper Spark job with predicate push down (https://github.com/apache/iceberg/pull/1421 and https://github.com/apache/iceberg/issues/1552) Going back to the common use-case of listing partitions, one quick win may be getting the min/max partition, which I saw a lot in Hive, by schedulers like airflow sensor to detect new data, or for retention tools to detect and kick-in the old data. Unfortunately it seems that aggregate push-down is not supported in Spark DSV2, unless I am mistaken, which would be the ideal, then it can pushed down to the PartitionTable. But the information seems also available on manifest-list (partition boundaries), which would be really fast, but does it make sense to expose this just as an Iceberg API for a quick win? Another workaround is just to push down predicate filter in the existing non-distributed PartitionTable TableScan, which is not done today. I think it would be a good change anyway. The query logic is a bit more complex to find the max/min partition (keep expanding the predicate from the expected latest until you hit something), but it's useful to answer other queries like for example how many partitions for a given day. Is there any reason not to do this? @rdblue @aokolnychyi do you guys have any thoughts about these ideas, or any other thoughts you had in the past about it? Thanks ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
