Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote:
> You need to do the book keeping of what has been processed yourself. This
> may mean roughly the following (of course the
You need to do the book keeping of what has been processed yourself. This may
mean roughly the following (of course the devil is in the details):
Write down in zookeeper which part of the processing job has been done and for
which dataset all the data has been created (do not keep the data
The boundary is a bit flexible. In terms of observed DStream effective
state the direct stream semantics is exactly-once.
In terms of external system observations (like message emission), Spark
Streaming semantics is at-least-once.
Regards,
Piotr
On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř
Hello John,
1. If a task complete the operation, it will notify driver. The driver
may not receive the message due to the network, and think the task is
still running. Then the child stage won't be scheduled ?
Spark's fault tolerance policy is, if there is a problem in processing a
task or
1. If a task complete the operation, it will notify driver. The driver may not
receive the message due to the network, and think the task is still running.
Then the child stage won't be scheduled ?
2. how do spark guarantee the downstream-task can receive the shuffle-data
completely. As fact, I