
I'm looking for the latest and greatest techniques and thoughts in stream
deduplication and would love to know if anyone here has done this at scale.
Specifically, I'm looking for deduping that also handles late-arriving

In the past few days of my search, I've mostly come across ideas and
implementations like

- Batching the stream based on time windows (non-overlapping) and deduping
within the batch
- Possible improvements on the above technique using overlaping time windows
- HDFS-specific cases where a stream is consumed (pretty batchy), written
to HDFS and deduped there

My source is Kafka, if that helps.


>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to