mosche commented on issue #24314:
URL: https://github.com/apache/beam/issues/24314#issuecomment-1334895553

   @apilloud I've already checked / compared `SQLPlan` and `BEAMPlan`. In the 
logs both plans look fine, the subtile difference can only be observed if 
checking the instances / ids of `RelNodes` in the plan.
   
   If a query contains a subquery that is repeatedly used (either by 
referencing a named subquery or just by repeating the same query string), this 
is reduced to a single instance RelNode that is reused throughout `SQLPlan`.
   
   The problem is that the `CheapestPlanReplacer ` copies all `RelNodes` after 
applying a rule, in this case the `BeamRelNode` related rules. After applying 
Beam's rules, `BEAMPlan ` has potentially lost a lot of semantic information 
about the query: knowledge of repeatedly used subqueries. This is rather bad 
because it's not possible for runners to optimize for that usage pattern 
anymore.
   
   This is the 
[hack](https://github.com/mosche/beam/commit/874fa34e7ed8567678443775adf520d326a30243)
 i've used to generate the transform hierarchies I've visualized above:
   - `denormalized`: The situation as-is when `BeamRelNodes` are copied leading 
to independent `PCollection` instances. 
   - `normalized`: A "hacked" copy that returns a cached `BeamRelNode` instance 
if previously copied with the same parameters. This way the semantic 
information is kept in the plan. And when `SqlTransform` is expanded we end up 
reusing such `PCollections` as these are cached by `RelNode#getId` (see 
[here](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSqlRelUtils.java#L91-L110))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to