vanzin commented on a change in pull request #26108: [SPARK-26154][SS]
Streaming left/right outer join should not return outer nulls for already
matched rows
URL: https://github.com/apache/spark/pull/26108#discussion_r341374700
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala
##########
@@ -187,17 +217,17 @@ class SymmetricHashJoinStateManager(
// Find the next value satisfying the condition, updating `currentKey`
and `numValues` if
// needed. Returns null when no value can be found.
- private def findNextValueForIndex(): UnsafeRow = {
+ private def findNextValueForIndex(): (UnsafeRow, Boolean) = {
// Loop across all values for the current key, and then all other
keys, until we find a
// value satisfying the removal condition.
def hasMoreValuesForCurrentKey = currentKey != null && index <
numValues
def hasMoreKeys = allKeyToNumValues.hasNext
while (hasMoreValuesForCurrentKey || hasMoreKeys) {
if (hasMoreValuesForCurrentKey) {
// First search the values for the current key.
- val currentValue = keyWithIndexToValue.get(currentKey, index)
+ val (currentValue, matched) =
keyWithIndexToValue.getWithMatched(currentKey, index)
Review comment:
`getWithMatched` can return `null`; is it guaranteed that it won't at this
point? Otherwise you'll get an error here.
Maybe `KeyWithIndexToValueRowConverterFormatV2` should return `(null,
false)` instead of `null`? It would at least be closer to the semantics of the
previous code.
And now that I noticed it, I kinda dislike `getWithMatched` for the same
reason I disliked the other type names.
Is the extra overhead of `valueRowConverter.convertValue` allocating a new
tuple, and this code "unapplying" that tuple, something that we need to worry
about? I'm thinking that maybe if you return a proper type and take the same
reuse approach as is done with `KeyToValuePair`, that would be better?
(But don't know if that can be done safely here.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]