If you use structured streaming and the file sink, you can have a subsequent stream read using the file source. This will maintain exactly once processing even if there are hiccups or failures.
On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hello Spark Experts, > > I have a design question w.r.t Spark Streaming. I have a streaming job > that consumes protocol buffer encoded real time logs from a Kafka cluster > on premise. My spark application runs on EMR (aws) and persists data onto > s3. Before I persist, I need to strip header and convert protobuffer to > parquet (I use sparksql-scalapb to convert from Protobuff to > Spark.sql.Row). I need to persist Raw logs as is. I can continue the > enrichment on the same dataframe after persisting the raw data, however, in > order to modularize I am planning to have a separate job which picks up the > raw data and performs enrichment on it. Also, I am trying to avoid all in > 1 job as the enrichments could get project specific while raw data > persistence stays customer/project agnostic.The enriched data is allowed to > have some latency (few minutes) > > My challenge is, after persisting the raw data, how do I chain the next > streaming job. The only way I can think of is - job 1 (raw data) > partitions on current date (YYYYMMDD) and within current date, the job 2 > (enrichment job) filters for records within 60s of current time and > performs enrichment on it in 60s batches. > Is this a good option? It seems to be error prone. When either of the jobs > get delayed due to bursts or any error/exception this could lead to huge > data losses and non-deterministic behavior . What are other alternatives to > this? > > Appreciate any guidance in this regard. > > regards > Sunita Koppar >