arunmahadevan commented on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634#issuecomment-457785995 Shouldn't this be based on the output mode?. In update mode it may be ok to emit `null` value for one side and later when the matching events arrive on the other side the new rows be re-emitted. However in append mode I would assume it should only emit the result after the event time passes the watermark threshold so there should not be updates for the same row. is that the case? I think here the confusion is the join predicate is also used to determine the watermark whereas ideally it should be independent. Anyways I think the events on both side should be buffered until the event time passes the watermark threshold and otherwise it would not produce the right results. Immediately discarding one side events implies we tie the watermark to be the event time of that side without any delay so late events on that side of input are immediately discarded. However the input on the other side are buffered which is wrong. So it seems that there is two different watermarks here (one for each input) which seems wrong. Ideally the watermark should be tied to the operator (join) and not separate watermarks for each input so that the operator can compute the result based on its watermark.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
