Jorn,
Thanks for the response. My downstream database is Kudu.
1. Yes. As you have suggested, I have been using a central caching mechanism
that caches the rdd results and to make a comparison with the next batch to
check for the latest timestamps and ignore the old timestamps. But, I see
What DB do you have?
You have some options, such as
1) use a key value store (they can be accessed very efficiently) to see if
there has been a newer key already processed - if yes then ignore value if no
then insert into database
2) redesign the key to include the timestamp and find out the
Hi All,
I am using Spark 2.2.0 & I have below use case:
*Reading from Kafka using Spark Streaming and updating(not just inserting)
the records into downstream database*
I understand that the way Spark read messages from Kafka will not be in a
order of timestamp as stored in Kafka partitions