rdblue commented on pull request #2017: URL: https://github.com/apache/iceberg/pull/2017#issuecomment-754277810
> It prohibits optimizations on delete/update/merge nodes and makes certain rewrites more complicated. I don't agree that those optimizations are prohibited. I think they are still possible, it would just make the implementation different. Your example would initially be converted to a full join producing rows for a merge node. But we could write a separate optimizer rule that matches a merge with a full join and checks whether the join type can be changed. Instead of doing the entire task in a single rewrite rule, we would have a rule to rewrite to the full join in the analyzer, then one to prune cases with `false` conditions, and then one to rewrite the join. I think that decomposing this into smaller independent rules is a good thing. > If we rewrite the plans after operator optimizations as we planned earlier, we miss ... We would also miss analyzer rules. In our Spark build, we have support for metadata columns and that's how we project `_file` and `_pos`. The problem is that the analyzer rule to add metadata columns to the logical plan runs in the analyzer, so we have to add them to the projection in the optimizer rewrite rule. I think that there are quite a few tasks and optimizations that have to be done in this rule because it needs to run late. We're accumulating quite a list of rules that won't run on these plans by doing the rewrite in the optimizer. And, we lose more depending on what batch we use in the optimizer. If the rewrite is done in operator optimization, then there's no guarantee that expressions are fully optimized before this rule -- that could be a problem for the expr-equivalent-to-false example. On the other hand, expressions produced by the rule would get optimized, unless we move it after operator optimization. If we do that, then the `Not(cond)` that we introduce might be `Not(Not(...))` or other cases that would be simplified by the optimizer, and columns in the dynamic filter plan are not pruned. I think overall I'm leaning toward rewriting in the analyzer. I think we can still build the same optimizations, just in different ways. That said, we probably don't need to really consider this until after the next release. Getting something working is the first step. Then we can talk about refactoring into separate rules. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
