Hi,
My spark streaming application is pulling data from Kafka.To prevent data
loss I have implemented WAL and enable checkpointing.On killing my
application and restarting it I am able to prevent data loss now but
however I am getting duplicate messages.
Is it because the application got killed
Exactly, that is the reason.
To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API
(called direct stream) which does not use Zookeeper at all to keep track of
progress, and maintains offset within Spark Streaming. That can guarantee
all records being received exactly-once. Its
: Re: Write ahead Logs and checkpoint
Exactly, that is the reason.
To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API
(called direct stream) which does not use Zookeeper at all to keep track of
progress, and maintains offset within Spark Streaming. That can guarantee
all
: user user@spark.apache.org
Subject: Re: Write ahead Logs and checkpoint
Exactly, that is the reason.
To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka
API (called direct stream) which does not use Zookeeper at all to keep
track of progress, and maintains offset within