HeartSaVioR commented on a change in pull request #26108: [SPARK-26154][SS] 
Streaming left/right outer join should not return outer nulls for already 
matched rows
URL: https://github.com/apache/spark/pull/26108#discussion_r339747402
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala
 ##########
 @@ -435,9 +541,10 @@ class SymmetricHashJoinStateManager(
     }
 
     /** Put new value for key at the given index */
-    def put(key: UnsafeRow, valueIndex: Long, value: UnsafeRow): Unit = {
+    def put(key: UnsafeRow, valueIndex: Long, value: UnsafeRow, matched: 
Boolean): Unit = {
       val keyWithIndex = keyWithIndexRow(key, valueIndex)
-      stateStore.put(keyWithIndex, value)
+      val valueWithMatched = valueRowConverter.convertToValueRow(value, 
matched)
 
 Review comment:
   > One thing is that a "rewrite" of the state in this case, at least, is not 
a good option. The data is missing from the v1 data, so there's no way to 
generate correct v2 data from it.
   
   Yeah you're right. At least for this case rewrite doesn't work. I forgot 
what I've already found  months ago. Thanks for reminding.
   
   > I can think of ways to hack it (you could e.g. compare the column count 
for the UnsafeRow read from the store and see if it matches the count in the 
expected schema, or if it has the extra fields expected by the v2 data), but 
haven't thoroughly thought about it.
   
   I also had thought about it a bit (and that was maybe one of review comment 
in previous PR) but it can bring side-effect if the query is changed. If end 
users change the query to let input of join containing one less column, there's 
a chance Spark may read "v1 row" as "v2 row" incorrectly whereas Spark should 
just fail the query since schema has been changed. Relying on column count is 
unsafe.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to