HeartSaVioR commented on a change in pull request #26108: [SPARK-26154][SS]
Streaming left/right outer join should not return outer nulls for already
matched rows
URL: https://github.com/apache/spark/pull/26108#discussion_r339747402
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala
##########
@@ -435,9 +541,10 @@ class SymmetricHashJoinStateManager(
}
/** Put new value for key at the given index */
- def put(key: UnsafeRow, valueIndex: Long, value: UnsafeRow): Unit = {
+ def put(key: UnsafeRow, valueIndex: Long, value: UnsafeRow, matched:
Boolean): Unit = {
val keyWithIndex = keyWithIndexRow(key, valueIndex)
- stateStore.put(keyWithIndex, value)
+ val valueWithMatched = valueRowConverter.convertToValueRow(value,
matched)
Review comment:
> One thing is that a "rewrite" of the state in this case, at least, is not
a good option. The data is missing from the v1 data, so there's no way to
generate correct v2 data from it.
Yeah you're right. At least for this case rewrite doesn't work. I forgot
what I've already found months ago.
> I can think of ways to hack it (you could e.g. compare the column count
for the UnsafeRow read from the store and see if it matches the count in the
expected schema, or if it has the extra fields expected by the v2 data), but
haven't thoroughly thought about it.
I also had thought about it a bit (and that was maybe one of review comment
in previous PR) but it can bring side-effect if the query is changed. If end
users change the query to let input of join containing one more column, there's
a chance Spark may read "v1 row" as "v2 row" incorrectly whereas Spark should
just fail the query since schema has been changed. Relying on column count is
unsafe.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]