Hello Spark Experts, I have a design question w.r.t Spark Streaming. I have a streaming job that consumes protocol buffer encoded real time logs from a Kafka cluster on premise. My spark application runs on EMR (aws) and persists data onto s3. Before I persist, I need to strip header and convert protobuffer to parquet (I use sparksql-scalapb to convert from Protobuff to Spark.sql.Row). I need to persist Raw logs as is. I can continue the enrichment on the same dataframe after persisting the raw data, however, in order to modularize I am planning to have a separate job which picks up the raw data and performs enrichment on it. Also, I am trying to avoid all in 1 job as the enrichments could get project specific while raw data persistence stays customer/project agnostic.The enriched data is allowed to have some latency (few minutes)
My challenge is, after persisting the raw data, how do I chain the next streaming job. The only way I can think of is - job 1 (raw data) partitions on current date (YYYYMMDD) and within current date, the job 2 (enrichment job) filters for records within 60s of current time and performs enrichment on it in 60s batches. Is this a good option? It seems to be error prone. When either of the jobs get delayed due to bursts or any error/exception this could lead to huge data losses and non-deterministic behavior . What are other alternatives to this? Appreciate any guidance in this regard. regards Sunita Koppar