[ 
https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107971#comment-16107971
 ] 

Apurva Mehta commented on KAFKA-5621:
-------------------------------------

Thanks for your comment Becket, and sorry for the delay in responding. My 
responses are inline:

{quote}Is it really different from applications and MM when a partition cannot 
make progress? It seems in both cases the users would want to know that at some 
point and handle it? I think retries are also for this purpose, otherwise we 
may block forever. If I understand right, what this ticket is proposing is just 
to extend the batch expiration time from request.timeout.ms to 
request.timeout.ms * reties. And KIP-91 proposes having an additional explicit 
configuration for that batch expiration time instead of deriving it from 
request timeout. They seem not quite different except that KIP-91 decouples the 
configurations from each other.
{quote}

This is a good question. Let me try to explain my point of view in more detail. 

When I talk about an 'application', I mean software which is using Kafka to 
solve some business problem. In this context, the partitions of a topic are 
more akin to an implementation detail to help with scaling throughput. From the 
point of view of application correctness, no partition can be left behind. 

Of course, not all applications fit this profile, but a significant number do 
(for instance, many streams applications). And for these applications, there 
should be a mode where Kafka does as much work as possible to ensure messages 
are delivered, because error handling is hard to reason about. For instance, an 
application level resend might introduce duplicates, and writing de-dup 
infrastructure is expensive and error prone --might as well rely on Kafka to do 
dedup for the application as much as possible. This is the motivation for the 
proposal in the current JIRA.

This contrasts with the MirrorMaker use case. If MirrorMaker is replicating 
1000 partitions, and there is some failure, it is still better to replicate 900 
partitions rather than 0 partitions. 

Having written all that, I think agree with you that there is value in adding a 
config to control the maximum time to wait for an acknowledgement, essentially 
your {{expiry.ms}} config. It might be more intuitive to name it something like 
{{message.max.delivery.wait.ms}}. Further, we can enforce that this is set to a 
minimum {{request.timeout.ms + linger.ms}}, which means that there would be at 
least one attempt to send the message when the producer isn't backed up. By 
default, we can leave it pretty high. 

So, we would then have the following: 

{{retries}} -- current meaning.
{{request.timeout.ms}} -- current meaning, but messages are not expired after 
this time.
{{message.max.delivery.wait.ms}} -- new config, controls how long to try to 
send messages before erroring them out.

I like this scheme. It doesn't expose users to the notion of accumulator queues 
(by avoiding any mention of 'batch'). It enables applications to delegate error 
handling to Kafka to the maximum possible extent (by setting 
{{retries=MAX_INT}} and {{message.max.delivery.wait.ms=MAX_LONG}}). And it 
enables MirrorMaker to bound the effect of unavailable partitions by setting 
{{message.max.delivery.wait.ms}} to be sufficiently low, presumably some 
function of the expected throughput in the steady state.

So in effect, I am in favor of KIP-91 with a few tweaks for the config name, 
it's default value, and it's semantics. What do the rest of you think?




> The producer should retry expired batches when retries are enabled
> ------------------------------------------------------------------
>
>                 Key: KAFKA-5621
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5621
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>             Fix For: 1.0.0
>
>
> Today, when a batch is expired in the accumulator, a {{TimeoutException}} is 
> raised to the user.
> It might be better the producer to retry the expired batch rather up to the 
> configured number of retries. This is more intuitive from the user's point of 
> view. 
> Further the proposed behavior makes it easier for applications like mirror 
> maker to provide ordering guarantees even when batches expire. Today, they 
> would resend the expired batch and it would get added to the back of the 
> queue, causing the output ordering to be different from the input ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to