arunmahadevan commented on issue #23634: [SPARK-26154][SS] Streaming left/right 
outer join should not return outer nulls for already matched rows
URL: https://github.com/apache/spark/pull/23634#issuecomment-457785995
 
 
   Shouldn't this be based on the output mode?. In update mode it may be ok to 
emit `null` value for one side and later when the matching events arrive on the 
other side the new rows be re-emitted.
   
   However in append mode I would assume it should only emit the result after 
the event time passes the watermark threshold so there should not be updates 
for the same row.
   
   is that the case?
   
   I think here the confusion is the join predicate is also used to determine 
the watermark whereas ideally it should be independent. Anyways I think the 
events on both side should be buffered until the event time passes the 
watermark threshold and otherwise it would not produce the right results. 
Immediately discarding one side events implies we tie the watermark to be the 
event time of that side without any delay so late events on that side of input 
are immediately discarded. However the input on the other side are buffered 
which is wrong.
   
   So it seems that there is two different watermarks here (one for each input) 
which seems wrong. Ideally  the watermark should be tied to the operator (join) 
and not separate watermarks for each input so that the operator can compute the 
result based on its watermark.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to