xinyuezg commented on PR #8965:
URL:
https://github.com/apache/incubator-gluten/pull/8965#issuecomment-3353138717
> @xinyuezg So what't the data diff looks like in your env? Are there any
duplicate records
Example we encountered from `GlutenOuterJoinSuiteForceShjOff`
* no condition full outer join using BroadcastNestedLoopJoin build left
(whole-stage-codegen off)
* no condition full outer join using BroadcastNestedLoopJoin build left
(whole-stage-codegen on)
* no condition full outer join using BroadcastNestedLoopJoin build right
(whole-stage-codegen off)
* no condition full outer join using BroadcastNestedLoopJoin build right
(whole-stage-codegen on)
Zooming into no condition full outer join using BroadcastNesteLoopJoin build
left:
Initial input data:
```
left(a, b)
(1, 2.0)
(2, 100.0)
(2, 1.0)
(2, 1.0) ← duplicated
(3, 3.0)
(5, 1.0)
(6, 6.0)
(null, null)
right(c, d)
(0, 0.0)
(2, 3.0)
(2, -1.0)
(2, -1.0) ← duplicated
(2, 3.0) ← duplicated
(3, 2.0)
(4, 1.0)
(5, 3.0)
(7, 7.0)
(null, null)
```
Spark plan:
```
BroadcastNestedLoopJoin BuildLeft, FullOuter
:- Filter (isnotnull(a#220) AND (a#220 = 2))
: +- Scan ExistingRDD[a#220,b#221]
+- Filter (isnotnull(c#226) AND (c#226 = 2))
+- Scan ExistingRDD[c#226,d#227]
```
So the effective inputs to joins are:
* filteredLeft (build side) = {(2, 100.0), (2, 1.0), (2, 1.0)} → 3 rows
* filteredRight (probe side) = {(2, 3.0), (2, -1.0), (2, -1.0), (2, 3.0)} →
4 rows
Results:
```
== Results ==
!== Expected Answer - 12 == == Actual Answer - 15 ==
[2,1.0,2,-1.0] [2,1.0,2,-1.0]
[2,1.0,2,-1.0] [2,1.0,2,-1.0]
[2,1.0,2,-1.0] [2,1.0,2,-1.0]
[2,1.0,2,-1.0] [2,1.0,2,-1.0]
[2,1.0,2,3.0] [2,1.0,2,3.0]
[2,1.0,2,3.0] [2,1.0,2,3.0]
[2,1.0,2,3.0] [2,1.0,2,3.0]
[2,1.0,2,3.0] [2,1.0,2,3.0]
![2,100.0,2,-1.0] [2,1.0,null,null]
![2,100.0,2,-1.0] [2,1.0,null,null]
![2,100.0,2,3.0] [2,100.0,2,-1.0]
![2,100.0,2,3.0] [2,100.0,2,-1.0]
! [2,100.0,2,3.0]
! [2,100.0,2,3.0]
! [2,100.0,null,null]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]