Hi,

currently I am exploring Spark’s fault tolerance capabilities in terms of fault 
recovery. Namely I run a Spark 2.1 standalone cluster on a master and four 
worker nodes. The application pulls data using the Kafka direct stream API from 
a Kafka topic over a (sliding) window of time, and writes transformed data back 
to another topic. During this process, using a bash script I randomly kill a 
Worker process with an expectation of getting insight onto RDD recovery using 
the log4j logs written by the Driver. However, expect of messages describing 
that a Worker has been lost, I cannot find any traces indicating that Spark is 
recovery the data lost by the killed Worker. Hence, I have the following 
questions: 

If using stateless transformations such as windowing does Spark checkpoint the 
data blocks or just the RDDs metadata? 
If not, is the state recovered from memory of a Worker to which the data has 
been replicated or just using the HDFS checkpoints? 
If Spark checkpoints and recovers the metadata only, how are exactly-once 
processing semantics achieved? I refer to processing semantics, and not output 
semantics, as the later would require storing the data into a transactional 
data store. 
Using Write Ahead Logs, would Spark recover the data from them in parallel 
instead of re-pulling the messages from Kafka? 

Thanks in advance for the clarification,
Dominik

Reply via email to