c21 edited a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671542777


   @cloud-fan, @agrawaldevesh and @viirya - if we go with <keyIndex, 
valueIndex>, I think -
   
   1.we probably still need one new abstract method for `HashedRelation`, which 
can be e.g. 
   
   `HashedRelation.getWithIndex(key: InternalRow): Iterator[(InternalRow, Int, 
Int)]`
   
   where for `UnsafeHashedRelation`, 1st `Int` is the 
`BytesToBytesMap.Location.pos` (serve as key index) and 2nd `Int` is the 
`BytesToBytesMap.Location.valueOffset` (save as value index).
   
   and for `LongHashedRelation`, this method `getWithIndex` will leave with 
`UnsupportedOperationException` for now.
   
   And we also still need `HashedRelation.valuesWithIndex(): 
Iterator[(InternalRow, Int, Int)]` to return all rows with key+value index for 
outputting rows on build side.
   
   2.In `ShuffledHashJoinExec.fullOuterJoin`, when streaming input rows, the 
`getWithIndex` will be called instead of `get/getValue` to get build side rows. 
A java normal `HashSet` or `LongToUnsafeRowMap` (I will check feasibility to 
squish key and value index into one long in this case) would be maintained 
separately inside `ShuffledHashJoinExec.fullOuterJoin`, to keep track of 
matched build side rows.
   
   Does it sound good as a plan? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to