[jira] [Updated] (SAMZA-1384) Race condition with async commit affects checkpoint correctness

Jake Maes (JIRA) Fri, 04 Aug 2017 11:04:12 -0700

     [ 
https://issues.apache.org/jira/browse/SAMZA-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jake Maes updated SAMZA-1384:
-----------------------------
    Description: 
tl;dr if any in-flight request updates the offsets between producer.flush() and 
offsetmanager.checkpoint() we could write a checkpoint for a message that did 
not yet go out over the wire and could subsequently fail.

Consider two threads A and B. A is performing an async commit. B is an 
in-flight process(). The following sequence will cause data loss:
A: TaskInstance.commit() begins
A: producer.flush() is called // no new messages will go out in this batch

B: producer.send() is called
B: TaskCallback is invoked for the finished process()
B: OffsetManager records the offset for the completed process()

A: producer.flush() finishes
A: checkpoint is written using the latest offsets from the OffsetManager. This 
INCLUDES the offset for the latest send, which has not yet gone out over the 
wire.
A: TaskInstance.commit() finishes

B: producer.send()->callback is invoked with an error. Send was unsuccessful, 
but has been checkpointed already. 
B: Exception is propagated and container fails
Result: Container is restarted and starts from the last checkpoint.

Note that this is only an issue when the commit() occurs concurrently with 
in-flight requests, so it doesn't affect the fully-synchronous mode or 
concurrent mode with synchronous commit().

Proposed solution: 
Take a snapshot of the offsets in the OffsetManager at the beginning of 
commit(). Only checkpoint those offsets and nothing new that has been sent 
since the commit() started.

  was:
Consider two threads A and B. A is performing an async commit. B is an 
in-flight process(). The following sequence will cause data loss:
A: TaskInstance.commit() begins
A: producer.flush() is called // no new messages will go out in this batch

B: producer.send() is called
B: TaskCallback is invoked for the finished process()
B: OffsetManager records the offset for the completed process()

A: producer.flush() finishes
A: checkpoint is written using the latest offsets from the OffsetManager. This 
INCLUDES the offset for the latest send, which has not yet gone out over the 
wire.
A: TaskInstance.commit() finishes

B: producer.send()->callback is invoked with an error. Send was unsuccessful, 
but has been checkpointed already. 
B: Exception is propagated and container fails
Result: Container is restarted and starts from the last checkpoint.

Note that this is only an issue when the commit() occurs concurrently with 
in-flight requests, so it doesn't affect the fully-synchronous mode or 
concurrent mode with synchronous commit().

Proposed solution: 
Take a snapshot of the offsets in the OffsetManager at the beginning of 
commit(). Only checkpoint those offsets and nothing new that has been sent 
since the commit() started.


> Race condition with async commit affects checkpoint correctness
> ---------------------------------------------------------------
>
>                 Key: SAMZA-1384
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1384
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>             Fix For: 0.14.0
>
>
> tl;dr if any in-flight request updates the offsets between producer.flush() 
> and offsetmanager.checkpoint() we could write a checkpoint for a message that 
> did not yet go out over the wire and could subsequently fail.
> Consider two threads A and B. A is performing an async commit. B is an 
> in-flight process(). The following sequence will cause data loss:
> A: TaskInstance.commit() begins
> A: producer.flush() is called // no new messages will go out in this batch
> B: producer.send() is called
> B: TaskCallback is invoked for the finished process()
> B: OffsetManager records the offset for the completed process()
> A: producer.flush() finishes
> A: checkpoint is written using the latest offsets from the OffsetManager. 
> This INCLUDES the offset for the latest send, which has not yet gone out over 
> the wire.
> A: TaskInstance.commit() finishes
> B: producer.send()->callback is invoked with an error. Send was unsuccessful, 
> but has been checkpointed already. 
> B: Exception is propagated and container fails
> Result: Container is restarted and starts from the last checkpoint.
> Note that this is only an issue when the commit() occurs concurrently with 
> in-flight requests, so it doesn't affect the fully-synchronous mode or 
> concurrent mode with synchronous commit().
> Proposed solution: 
> Take a snapshot of the offsets in the OffsetManager at the beginning of 
> commit(). Only checkpoint those offsets and nothing new that has been sent 
> since the commit() started.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SAMZA-1384) Race condition with async commit affects checkpoint correctness

Reply via email to