Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Jörn Franke
You need to do the book keeping of what has been processed yourself. This may mean roughly the following (of course the devil is in the details): Write down in zookeeper which part of the processing job has been done and for which dataset all the data has been created (do not keep the data

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Piotr Smoliński
The boundary is a bit flexible. In terms of observed DStream effective state the direct stream semantics is exactly-once. In terms of external system observations (like message emission), Spark Streaming semantics is at-least-once. Regards, Piotr On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread Michal Šenkýř
Hello John, 1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? Spark's fault tolerance policy is, if there is a problem in processing a task or

Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread John Fang
1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? 2. how do spark guarantee the downstream-task can receive the shuffle-data completely. As fact, I