I drafted an implementation outline in kafka-streams to address the
problem of sliding-window reordering (to cater for late messages within
the time window), it also caters for de-duplication:
https://stackoverflow.com/questions/43939534/apache-kafka-order-windowed-messages-based-on-their-value/44345374#44345374
You can implement something similar in akka-streams I believe.
First thing that comes to mind is to sink messages into a sorted map
(keyed by event-time timestamp and msg key pair) and then a new periodic
source picks them up - and connect the two with a
Flow.fromSinkAndSource. You'll need to take care of offset commits -
after the windowing de-duplication stage, i.e. on restart you don't want
to lose the messages buffered in the map.
Looking forward to ideas how to do this better.
Cheers,
MichaĆ
On 23/06/17 17:42, Shiva Ramagopal wrote:
Hi,
I'm looking for the latest and greatest techniques and thoughts in
stream deduplication and would love to know if anyone here has done
this at scale. Specifically, I'm looking for deduping that also
handles late-arriving messages.
In the past few days of my search, I've mostly come across ideas and
implementations like
- Batching the stream based on time windows (non-overlapping) and
deduping within the batch
- Possible improvements on the above technique using overlaping time
windows
- HDFS-specific cases where a stream is consumed (pretty batchy),
written to HDFS and deduped there
My source is Kafka, if that helps.
Thanks
Shiv
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google
Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to [email protected]
<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
--
Signature
<http://www.openbet.com/> Michal Borowiecki
Senior Software Engineer L4
T: +44 208 742 1600
+44 203 249 8448
E: [email protected]
W: www.openbet.com <http://www.openbet.com/>
OpenBet Ltd
Chiswick Park Building 9
566 Chiswick High Rd
London
W4 5XT
UK
<https://www.openbet.com/email_promo>
This message is confidential and intended only for the addressee. If you
have received this message in error, please immediately notify the
[email protected] <mailto:[email protected]> and delete it
from your system as well as any copies. The content of e-mails as well
as traffic data may be monitored by OpenBet for employment and security
purposes. To protect the environment please do not print this e-mail
unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
registered in England and Wales. Registered no. 3134634. VAT no.
GB927523612
--
Read the docs: http://akka.io/docs/
Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.