Hi, I'm looking for the latest and greatest techniques and thoughts in stream deduplication and would love to know if anyone here has done this at scale. Specifically, I'm looking for deduping that also handles late-arriving messages.
In the past few days of my search, I've mostly come across ideas and implementations like - Batching the stream based on time windows (non-overlapping) and deduping within the batch - Possible improvements on the above technique using overlaping time windows - HDFS-specific cases where a stream is consumed (pretty batchy), written to HDFS and deduped there My source is Kafka, if that helps. Thanks Shiv -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.