[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569248#comment-15569248
 ] 

Siyuan Hua commented on APEXMALHAR-2283:
----------------------------------------

There are couple of solutions to "exactly-once". To me they are all different 
and they all have different assumptions. 
First of all, if we assume there is unique message id, we can definitely use 
that for dedup, but that is not always the case, then we need appid and 
operatorid to do dedup.
Then this information can be store in either key or extra topic, I wouldn't say 
either of them is better than the other, really depends on how user's 
requirement.
And no matter what we do, as long as there are number of operators writes to 
same kafka partition, I'm afraid there is no way to do perfect dedup because we 
don't know the safe place to do dedup from or too late to do dedup(too much 
noise from the safe place if other operator instances are fully loaded)
I don't remember hashcode, but I think hashcode will include some false 
positive?

And there is other solution like, create topic and number of partitions 
automatically based on kafka operator instances, in this case, it is much 
easier, we are always use what messages needs to be dedup because one operator 
only write to one kafka partition. This solution, my understanding, is most 
reliable and some user might want it. But the metadata of that kafka topic is 
kind of automatic created and it's very hard to support dynamic partition in 
this case.

That's the reason why I say there is no general solution for exactly-once kafka 
output operator. We may need to provide different solutions in "examples" for 
people to choose from. 

Anyways, Sandesh, can you wrap up the current solution, post and discuss it in 
mailing list?

 

> Refactor kafka output operator
> ------------------------------
>
>                 Key: APEXMALHAR-2283
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2283
>             Project: Apache Apex Malhar
>          Issue Type: Improvement
>            Reporter: Siyuan Hua
>            Assignee: Siyuan Hua
>
> The abstract kafka output operator needs to be refactored
> 1. Needs to set some mandatory properties on operator level instead of kafka 
> property level.
> 2. More document and examples
> 3. Find a standard way to achieve exactly once in both 0.8 and 0.9
> More will be added when working on the ticket



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to