Re: Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Michael Armbrust
I'll answer in the context of structured streaming (the new streaming API build on DataFrames). When reading from files, the FileSource, records which files are included in each batch inside of the given checkpointLocation. If you fail in the middle of a batch, the streaming engine will retry

Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Scott W
Hello, I'm planning to use fileStream Spark streaming API to stream data from HDFS. My Spark job would essentially process these files and post the results to an external endpoint. *How does fileStream API handle checkpointing of the file it processed ? *In other words, if my Spark job failed