[
https://issues.apache.org/jira/browse/SAMZA-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297718#comment-14297718
]
Jae Hyeon Bae commented on SAMZA-541:
-------------------------------------
I won't touch KafkaSystemProducer implementation. This feature is needed for
our data pipeline use case. If this cannot be supported, I might have to switch
SystemProducer implementation to StreamTask again.
> Passing coordinator to producer chains
> --------------------------------------
>
> Key: SAMZA-541
> URL: https://issues.apache.org/jira/browse/SAMZA-541
> Project: Samza
> Issue Type: Improvement
> Components: container
> Reporter: Jae Hyeon Bae
> Assignee: Jae Hyeon Bae
>
> StreamTask can control SamzaContainer.commit() through task coordinator but
> SystemProducer can call flush() without commit() which can create duplicate
> data on container failure. For example,
> T=30s; flush
> T=45s; flush
> T=60s; flush && commit
> T=65s; flush
> "If there's a failure before 60s, the messages that were flushed at 30s and
> 45s will be duplicated when the container reprocesses" (quoted from
> [~criccomini] response).
> If SystemProducer can call access TaskCoordinator created from RunLoop in
> SamzaContainer, it will be flexible to control 'exactly-once-delivery'.
> The following interface should be changed:
> {code}
> MessageCollector.send(envelop, coordinator)
> SystemProducers.send(source, envelop, coordinator)
> SystemProducer.send(source, envelop, coordinator)
> {code}
> Kafka 0.8 new Java producer cannot be synchronized with TaskInstance.commit
> because it doesn't do batch-flush. Depending on SystemProducer
> implementation, it can guarantee 'exactly-once-delivery'.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)