c21 edited a comment on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671542777
@cloud-fan, @agrawaldevesh and @viirya - if we go with <keyIndex, valueIndex>, I think - 1.we probably still need one new abstract method for `HashedRelation`, which can be e.g. `HashedRelation.getWithIndex(key: InternalRow): Iterator[(InternalRow, Int, Int)]` where for `UnsafeHashedRelation`, 1st `Int` is the `BytesToBytesMap.Location.pos` (serve as key index) and 2nd `Int` is the `BytesToBytesMap.Location.valueOffset` (save as value index). and for `LongHashedRelation`, this method `getWithIndex` will leave with `UnsupportedOperationException` for now. And we also still need `HashedRelation.valuesWithIndex(): Iterator[(InternalRow, Int, Int)]` to return all rows with key+value index for outputting rows on build side. 2.In `ShuffledHashJoinExec.fullOuterJoin`, when streaming input rows, the `getWithIndex` will be called instead of `get/getValue` to get build side rows. A java normal `HashSet` or `LongToUnsafeRowMap` (I will check feasibility to squish key and value index into one long in this case) would be maintained separately inside `ShuffledHashJoinExec.fullOuterJoin`, to keep track of matched build side rows. Does it sound good as a plan? Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
