GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/22402

    [SPARK-25414][SS] The numInputRows metrics can be incorrect for streaming 
self-join

    ## What changes were proposed in this pull request?
    
    For self-join/self-union, Spark will produce a physical plan which has 
multiple `DataSourceV2ScanExec` instances referring to the same `ReadSupport` 
instance. In this case, we should not double count the numInputRows metrics.
    
    Actually we already have 2 test cases to verify the behavior:
    1. `StreamingQuerySuite.input row calculation with same V2 source used 
twice in self-join`
    2. `KafkaMicroBatchSourceSuiteBase.ensure stream-stream self-join generates 
only one offset in log and correct metrics`.
    
    However, the first test is wrong written and exposes the bug. The bug is 
hidden in the second test, because exchange reuse is triggered and the plan 
only has one `DataSourceV2ScanExec` instance.
    
    Note that, some people may think this bug is caused by 
https://github.com/apache/spark/pull/22380 . It's obviously not because I 
didn't change any test in #22380. Actually we are aware of this bug and even 
mentioned it in the comment, but forgot to handle it:
    ```
    ...
    // 3. Multiple DataSourceV2ScanExec instance may refer to the same source 
(can happen with
    //    self-unions or self-joins). Add up the number of rows for each unique 
source.
    ```
    
    ## How was this patch tested?
    
    The updated test case

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22402.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22402
    
----
commit d2fc07f320fdb12beaf0143442cfe2fe55acbba1
Author: Wenchen Fan <wenchen@...>
Date:   2018-09-12T10:48:03Z

    fix streaming metrics for self-join

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to