Re: Spark streaming to kafka exactly once
Ok, Thanks for your answers On 3/22/17, 1:34 PM, "Cody Koeninger"wrote: If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging If you're talking about producers just misbehaving and sending different copies of what is essentially the same message from a domain perspective, you have to dedupe that with your own logic. On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaver wrote: > You have to handle de-duplication upstream or downstream. It might > technically be possible to handle this in Spark but you'll probably have a > better time handling duplicates in the service that reads from Kafka. > > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart > wrote: >> >> Hi, >> we are trying to build a spark streaming solution that subscribe and push >> to kafka. >> >> But we are running into the problem of duplicates events. >> >> Right now, I am doing a “forEachRdd” and loop over the message of each >> partition and send those message to kafka. >> >> >> >> Is there any good way of solving that issue? >> >> >> >> thanks > > > > > -- > Regards, > > Matt > Data Engineer > https://www.linkedin.com/in/mdeaver > http://mattdeav.pythonanywhere.com/
Re: Spark streaming to kafka exactly once
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging If you're talking about producers just misbehaving and sending different copies of what is essentially the same message from a domain perspective, you have to dedupe that with your own logic. On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaverwrote: > You have to handle de-duplication upstream or downstream. It might > technically be possible to handle this in Spark but you'll probably have a > better time handling duplicates in the service that reads from Kafka. > > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart > wrote: >> >> Hi, >> we are trying to build a spark streaming solution that subscribe and push >> to kafka. >> >> But we are running into the problem of duplicates events. >> >> Right now, I am doing a “forEachRdd” and loop over the message of each >> partition and send those message to kafka. >> >> >> >> Is there any good way of solving that issue? >> >> >> >> thanks > > > > > -- > Regards, > > Matt > Data Engineer > https://www.linkedin.com/in/mdeaver > http://mattdeav.pythonanywhere.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark streaming to kafka exactly once
You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka. On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglartwrote: > Hi, > we are trying to build a spark streaming solution that subscribe and push > to kafka. > > But we are running into the problem of duplicates events. > > Right now, I am doing a “forEachRdd” and loop over the message of each > partition and send those message to kafka. > > > > Is there any good way of solving that issue? > > > > thanks > -- Regards, Matt Data Engineer https://www.linkedin.com/in/mdeaver http://mattdeav.pythonanywhere.com/
Spark streaming to kafka exactly once
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving that issue? thanks