GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22402
[SPARK-25414][SS] The numInputRows metrics can be incorrect for streaming
self-join
## What changes were proposed in this pull request?
For self-join/self-union, Spark will produce a physical plan which has
multiple `DataSourceV2ScanExec` instances referring to the same `ReadSupport`
instance. In this case, we should not double count the numInputRows metrics.
Actually we already have 2 test cases to verify the behavior:
1. `StreamingQuerySuite.input row calculation with same V2 source used
twice in self-join`
2. `KafkaMicroBatchSourceSuiteBase.ensure stream-stream self-join generates
only one offset in log and correct metrics`.
However, the first test is wrong written and exposes the bug. The bug is
hidden in the second test, because exchange reuse is triggered and the plan
only has one `DataSourceV2ScanExec` instance.
Note that, some people may think this bug is caused by
https://github.com/apache/spark/pull/22380 . It's obviously not because I
didn't change any test in #22380. Actually we are aware of this bug and even
mentioned it in the comment, but forgot to handle it:
```
...
// 3. Multiple DataSourceV2ScanExec instance may refer to the same source
(can happen with
// self-unions or self-joins). Add up the number of rows for each unique
source.
```
## How was this patch tested?
The updated test case
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark bug
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22402.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22402
----
commit d2fc07f320fdb12beaf0143442cfe2fe55acbba1
Author: Wenchen Fan <wenchen@...>
Date: 2018-09-12T10:48:03Z
fix streaming metrics for self-join
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]