[ 
https://issues.apache.org/jira/browse/SPARK-18552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15692322#comment-15692322
 ] 

Liwei Lin commented on SPARK-18552:
-----------------------------------

Yea that's right. Thanks a lot for the clarification!

Another case I thought about was, today the watermark could have evolved in 
this way: 5(get logged), 10(get logged), 15(not logged), failure + recovery, 
10(recovered), 20(get logged). Note that the value 15 appeared once and was 
erased later on.

It first seemed to me as breaking the promise we made -- {{...note that the 
watermark is guaranteed to move forward wrt to processing/clock time}}(quoted 
from the watermark design doc). Then I figured the erased value 15 would not 
lead the sink to receive more outputs or less outputs, i.e. correctness is not 
affected. Just metrics or something might have observed the value 15.

So I'm closing this. [~zsxwing] thanks again.

> Watermark should not rely on sinks to proceed
> ---------------------------------------------
>
>                 Key: SPARK-18552
>                 URL: https://issues.apache.org/jira/browse/SPARK-18552
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>            Reporter: Liwei Lin
>            Priority: Critical
>
> Today for watermark to be collected and proceed correctly, a sink should 
> trigger the real execution the dataset it received in every batch.
> However, during the recovery process, a sink might skip a batch (such as in 
> https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala#L202-L204),
>  then the watermark just goes wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to