aokolnychyi edited a comment on issue #1422: URL: https://github.com/apache/iceberg/issues/1422#issuecomment-687464819
I haven't formed an opinion on what approach is more promising right now. Here are my thoughts after a quick look: **Option 1 (extension to the `TableScan` API)** Pros: - Reuses the existing filtering logic. - API could be reused by other query engines. - Seems to require less code? Cons: - Is a substantial change to the core planning logic that requires thorough testing (both performance and correctness). - Requires to think about serialization and especially Kryo serialization during planning (was not needed before). **Option 2 (metadata tables)** Pros: - Reuses the existing logic to read manifests in a scalable way via metadata tables. - Reuses the existing logic for wrapping Spark `Row`s into Iceberg `DataFile`s. - Doesn't touch the core planning API and is more isolated. - Maybe, can be exposed as an action (makes sense or not?) Cons: - Requires instantiating evaluators ourselves. - Seems to require a bit more code but I feel like it can be simplified. - Specific to Spark but could be implemented in other systems that support metadata tables. Implementation aside, we need to consider when to apply this. Ideally, we would have some sort of a threshold for the size of manifests we need to scan after manifest filtering. If we narrow down the scope to a couple of manifests, plan locally. Otherwise, plan using Spark. I am not sure it will be that easy, though. Instead, we could support a Spark read option that would tell us which mode to use. The value can be local, distributed, auto. In auto, the simplest option is to analyze the scan condition. If there is a reasonable partition predicate (e.g. `equals` or `inSet`), we could always do planning locally. If not and if distributed is enabled, leverage Spark. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
