[ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107971#comment-16107971 ]
Apurva Mehta commented on KAFKA-5621: ------------------------------------- Thanks for your comment Becket, and sorry for the delay in responding. My responses are inline: {quote}Is it really different from applications and MM when a partition cannot make progress? It seems in both cases the users would want to know that at some point and handle it? I think retries are also for this purpose, otherwise we may block forever. If I understand right, what this ticket is proposing is just to extend the batch expiration time from request.timeout.ms to request.timeout.ms * reties. And KIP-91 proposes having an additional explicit configuration for that batch expiration time instead of deriving it from request timeout. They seem not quite different except that KIP-91 decouples the configurations from each other. {quote} This is a good question. Let me try to explain my point of view in more detail. When I talk about an 'application', I mean software which is using Kafka to solve some business problem. In this context, the partitions of a topic are more akin to an implementation detail to help with scaling throughput. From the point of view of application correctness, no partition can be left behind. Of course, not all applications fit this profile, but a significant number do (for instance, many streams applications). And for these applications, there should be a mode where Kafka does as much work as possible to ensure messages are delivered, because error handling is hard to reason about. For instance, an application level resend might introduce duplicates, and writing de-dup infrastructure is expensive and error prone --might as well rely on Kafka to do dedup for the application as much as possible. This is the motivation for the proposal in the current JIRA. This contrasts with the MirrorMaker use case. If MirrorMaker is replicating 1000 partitions, and there is some failure, it is still better to replicate 900 partitions rather than 0 partitions. Having written all that, I think agree with you that there is value in adding a config to control the maximum time to wait for an acknowledgement, essentially your {{expiry.ms}} config. It might be more intuitive to name it something like {{message.max.delivery.wait.ms}}. Further, we can enforce that this is set to a minimum {{request.timeout.ms + linger.ms}}, which means that there would be at least one attempt to send the message when the producer isn't backed up. By default, we can leave it pretty high. So, we would then have the following: {{retries}} -- current meaning. {{request.timeout.ms}} -- current meaning, but messages are not expired after this time. {{message.max.delivery.wait.ms}} -- new config, controls how long to try to send messages before erroring them out. I like this scheme. It doesn't expose users to the notion of accumulator queues (by avoiding any mention of 'batch'). It enables applications to delegate error handling to Kafka to the maximum possible extent (by setting {{retries=MAX_INT}} and {{message.max.delivery.wait.ms=MAX_LONG}}). And it enables MirrorMaker to bound the effect of unavailable partitions by setting {{message.max.delivery.wait.ms}} to be sufficiently low, presumably some function of the expected throughput in the steady state. So in effect, I am in favor of KIP-91 with a few tweaks for the config name, it's default value, and it's semantics. What do the rest of you think? > The producer should retry expired batches when retries are enabled > ------------------------------------------------------------------ > > Key: KAFKA-5621 > URL: https://issues.apache.org/jira/browse/KAFKA-5621 > Project: Kafka > Issue Type: Bug > Reporter: Apurva Mehta > Assignee: Apurva Mehta > Fix For: 1.0.0 > > > Today, when a batch is expired in the accumulator, a {{TimeoutException}} is > raised to the user. > It might be better the producer to retry the expired batch rather up to the > configured number of retries. This is more intuitive from the user's point of > view. > Further the proposed behavior makes it easier for applications like mirror > maker to provide ordering guarantees even when batches expire. Today, they > would resend the expired batch and it would get added to the back of the > queue, causing the output ordering to be different from the input ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029)