aokolnychyi commented on issue #1422: URL: https://github.com/apache/iceberg/issues/1422#issuecomment-686915430
To give a bit more background, job planning is fast even on huge tables if we have a partition predicate and end up processing 10-15 partitions in a single job. We have `RewriteManifestsAction` to rewrite metadata and align it with partitions on demand. That covers all common use cases. At the same time, Iceberg supports file filtering within partitions and opens the opportunity for efficient *full* table scans. So there are two use cases we want to address: - job planning for full table scans (less important) - job planning for queries with predicates only on sort key in partitioned tables (becomes common in Iceberg tables) The second use case is the primary one. It is a common pattern to partition tables by date and sort by key and have queries *without* partition predicates but with predicates on the sort key. Such use cases were not possible without Iceberg since we could not filter files within partitions and we ended up with full table scans. Right now, we can narrow down the number of matching files to 1-2 per partition. So we can scan PB scale tables for a given key faster. If we do so right now, we end up spending most of the time during job planning. That's why it would be better to parallelize that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
