mosche commented on issue #24314: URL: https://github.com/apache/beam/issues/24314#issuecomment-1334895553
@apilloud I've already checked / compared `SQLPlan` and `BEAMPlan`. In the logs both plans look fine, the subtile difference can only be observed if checking the instances / ids of `RelNodes` in the plan. If a query contains a subquery that is repeatedly used (either by referencing a named subquery or just by repeating the same query string), this is reduced to a single instance RelNode that is reused throughout `SQLPlan`. The problem is that the `CheapestPlanReplacer ` copies all `RelNodes` after applying a rule, in this case the `BeamRelNode` related rules. After applying Beam's rules, `BEAMPlan ` has potentially lost a lot of semantic information about the query: knowledge of repeatedly used subqueries. This is rather bad because it's not possible for runners to optimize for that usage pattern anymore. This is the [hack](https://github.com/mosche/beam/commit/874fa34e7ed8567678443775adf520d326a30243) i've used to generate the transform hierarchies I've visualized above: - `denormalized`: The situation as-is when `BeamRelNodes` are copied leading to independent `PCollection` instances. - `normalized`: A "hacked" copy that returns a cached `BeamRelNode` instance if previously copied with the same parameters. This way the semantic information is kept in the plan. And when `SqlTransform` is expanded we end up reusing such `PCollections` as these are cached by `RelNode#getId` (see [here](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSqlRelUtils.java#L91-L110)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
