HeartSaVioR opened a new pull request, #46953: URL: https://github.com/apache/spark/pull/46953
### What changes were proposed in this pull request? This PR proposes to introduce a marker for isStreaming property in text representation of logical plan. The marker will be `~`, along with `!` (invalid) and `'` (unresolved). This PR proposes to retain the prefix marker as single character (opposed to up to two characters). This would be OK in practice, since the moment the marker for isStreaming would be useful is to look into the plan which is already analyzed - that said, it’s unlikely that we need to see the both one of existing marker and the marker for streaming. ### Why are the changes needed? This would help tracking down QO issues happening with streaming query much easier. For example, here is the example of the rule which triggered [SPARK-47305](https://issues.apache.org/jira/browse/SPARK-47305): ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters === WriteToMicroBatchDataSource MemorySink, 49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1 WriteToMicroBatchDataSource MemorySink, 49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1 +- Project [value#45] +- Project [value#45] +- Join Inner +- Join Inner :- Project [value#45] :- Project [value#45] : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource +- Project +- Project ! +- Filter false +- LocalRelation <empty>, [id#54L] ! +- Range (1, 5, step=1, splits=Some(2)) ``` The bug of SPARK-47305 was, LocalRelation in above was "incorrectly" marked as streaming=true where it should be streaming=false. There is no notion of isStreaming flag in the text representation of LocalRelation, hence from the text plan we would never know the rule had a bug. Even though we assume we show the value of isStreaming in LocalRelation, the depth of subtree could be huge in practice and it's not friendly to go down to the leaf node to figure out the isStreaming value of the entire subtree. After this PR, the above rule information will be changed as below: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters === ~WriteToMicroBatchDataSource MemorySink, dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1 ~WriteToMicroBatchDataSource MemorySink, dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1 +- ~Project [value#45] +- ~Project [value#45] +- ~Join Inner +- ~Join Inner :- ~Project [value#45] :- ~Project [value#45] : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource ! +- Project +- ~Project ! +- Filter false +- ~LocalRelation <empty>, [id#54L] ! +- Range (1, 5, step=1, splits=Some(2)) ``` Now it's obvious that isStreaming flag of leaf node had changed. Also, to check the isStreaming flag of children for Join, we just need to look at the first node of subtree for children, instead of going down to leaf nodes. ### Does this PR introduce _any_ user-facing change? Yes, since the textual representation of logical plan will be changed a bit. But it's only applied to the streaming Dataset, and also the textual representation of logical plan is arguably not a public API. (Keeping backward compatibility of the text is technically very hard.) ### How was this patch tested? Existing UTs for regression text on batch query. For streaming query, the above section contains the changed text representation of logical plan. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
