leanken edited a comment on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667760805
> > let's say there is a record
> > (1, null, 3) in probe side, if there is a (1,2,3) in build side, it's
counted as `MATCH` in comparison. basically if i want to avoid 0(M*N) which is
loop look up in build side, i will have to expand (1,2,3) with all combination
null padding new records like
> > Original key expand to 2^3 -1 = 7X keys, and we can use probe side
record (1, null, 3) to just directly hash loop up with such data duplication. I
don't know if I make it clean for you @agrawaldevesh , it is a bit hard for me
to explain in english. ^_^
> > (1, 2,3 )
> > (null, 2, 3)
> > (1, null, 3)
> > (1, 2, null)
> > (null, null, 3)
> > (null, 2, null)
> > (1, null, null)
>
> Let's consider for both steps 2 and 3 of section 6.2 in the NAAJ
separately:
>
> * Step 2: Say there is a right (build) side row (1, null, 3). It should be
counted as a match against a row on the left side (1, 2, 3). What makes this
tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should
NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1,
null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of
these subcases ?
> * Step 3: Consider a build row (1, 5, null), it should match the left row
(1, null, 3). In addition, it should not match the build row (1, 5, 7). How do
you handle these subcases ?
After we expand data in BuildSide
streamedSide (1 , 2, 3) and buildSide (1, null, 3) is not counted as a
Match, we are looking for exactly same UnsafeRow, which has same hashCode,
therefore we can use streamedUnsafeRow as key to look up in HashedRelation.
it's a bit tricky, but to sum up. After the expansion, I need to find exact
same match include null column from streamedSide in HashedRelation, which is
counted as a Match.
Just see how it is probed in Code.
```
if (hashed == EmptyHashedRelation) {
streamedIter
} else if (hashed == EmptyHashedRelationWithAllNullKeys) {
Iterator.empty
} else {
val keyGenerator = UnsafeProjection.create(
BindReferences.bindReferences[Expression](
leftKeys,
AttributeSeq(left.output))
)
streamedIter.filter(row => {
val lookupKey: UnsafeRow = keyGenerator(row)
if (lookupKey.allNull()) {
false
} else {
// Anti Join: Drop the row on the streamed side if it is a
match on the build
// !!!! if hashed is a (1, null, 3), lookupKey is a (1,2,3) the return
would be null
hashed.get(lookupKey) == null
}
})
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]