[PR] [SPARK-48597][SQL] Introduce a marker for isStreaming property in text representation of logical plan [spark]

via GitHub Wed, 12 Jun 2024 01:02:58 -0700


HeartSaVioR opened a new pull request, #46953:
URL: https://github.com/apache/spark/pull/46953


   ### What changes were proposed in this pull request?
   
   This PR proposes to introduce a marker for isStreaming property in text 
representation of logical plan.
   
   The marker will be `~`, along with `!` (invalid) and `'` (unresolved).
   
   This PR proposes to retain the prefix marker as single character (opposed to 
up to two characters). This would be OK in practice, since the moment the 
marker for isStreaming would be useful is to look into the plan which is 
already analyzed - that said, it’s unlikely that we need to see the both one of 
existing marker and the marker for streaming.
   
   ### Why are the changes needed?
   
   This would help tracking down QO issues happening with streaming query much 
easier. For example, here is the example of the rule which triggered 
[SPARK-47305](https://issues.apache.org/jira/browse/SPARK-47305):
   
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters ===
    WriteToMicroBatchDataSource MemorySink, 
49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1   WriteToMicroBatchDataSource 
MemorySink, 49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1
    +- Project [value#45]                                                       
              +- Project [value#45]
       +- Join Inner                                                            
                 +- Join Inner
          :- Project [value#45]                                                 
                    :- Project [value#45]
          :  +- StreamingDataSourceV2ScanRelation[value#45] 
MemoryStreamDataSource                  :  +- 
StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource
          +- Project                                                            
                    +- Project
   !         +- Filter false                                                    
                       +- LocalRelation <empty>, [id#54L]
   !            +- Range (1, 5, step=1, splits=Some(2))          
   ```
   
   The bug of SPARK-47305 was, LocalRelation in above was "incorrectly" marked 
as streaming=true where it should be streaming=false. There is no notion of 
isStreaming flag in the text representation of LocalRelation, hence from the 
text plan we would never know the rule had a bug. Even though we assume we show 
the value of isStreaming in LocalRelation, the depth of subtree could be huge 
in practice and it's not friendly to go down to the leaf node to figure out the 
isStreaming value of the entire subtree.
   
   After this PR, the above rule information will be changed as below:
   
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters ===
    ~WriteToMicroBatchDataSource MemorySink, 
dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1   ~WriteToMicroBatchDataSource 
MemorySink, dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1
    +- ~Project [value#45]                                                      
               +- ~Project [value#45]
       +- ~Join Inner                                                           
                  +- ~Join Inner
          :- ~Project [value#45]                                                
                     :- ~Project [value#45]
          :  +- StreamingDataSourceV2ScanRelation[value#45] 
MemoryStreamDataSource                   :  +- 
StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource
   !      +- Project                                                            
                     +- ~Project
   !         +- Filter false                                                    
                        +- ~LocalRelation <empty>, [id#54L]
   !            +- Range (1, 5, step=1, splits=Some(2))                     
   ```
   
   Now it's obvious that isStreaming flag of leaf node had changed. Also, to 
check the isStreaming flag of children for Join, we just need to look at the 
first node of subtree for children, instead of going down to leaf nodes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, since the textual representation of logical plan will be changed a bit. 
But it's only applied to the streaming Dataset, and also the textual 
representation of logical plan is arguably not a public API. (Keeping backward 
compatibility of the text is technically very hard.)
   
   ### How was this patch tested?
   
   Existing UTs for regression text on batch query. For streaming query, the 
above section contains the changed text representation of logical plan.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48597][SQL] Introduce a marker for isStreaming property in text representation of logical plan [spark]

Reply via email to