[jira] [Created] (KAFKA-4736) producer failed too slow when meta request failed

Jun Yao (JIRA) Sun, 05 Feb 2017 11:04:22 -0800

Jun Yao created KAFKA-4736:
------------------------------

             Summary: producer failed too slow when meta request failed
                 Key: KAFKA-4736
                 URL: https://issues.apache.org/jira/browse/KAFKA-4736
             Project: Kafka
          Issue Type: Bug
          Components: producer 
            Reporter: Jun Yao



This might be similar to https://issues.apache.org/jira/browse/KAFKA-4385 but 
happen in a different case. 
In some cases as tested, the producer may get some invalid metadata (and it's 
always invalid when there are some issues on the broker side), 

so whenever calling KafkaProducer.send(), it will spent 60seconds (in default 
configuration) on KafkaProducer.waitOnMetadata() and then throw 
TimeoutException("Failed to update metadata after 60000 ms"),

so when there are something wrong on some topic that the producer did not get 
the metadata of the topic, it will be like 'blocked' by this topic for 
60seconds,  and impacting other topics sending. 

for cases that we want to utilizing the "Callback" to save those failed 
requests data in a different place and retry later, the Callback is also called 
every 60seconds. so if upstream is keep receiving data and calling 
producer.send(), 
it will soon be blocking or buffering too much in memory if upstream has a 
buffer of data before calling producer.send. 

It looks to me the KafkaProducer.send() is failing too slow (not fail fast) 
when something is wrong on some topic/broker.  and it's always in this slow 
failure state. 

I am not sure if reducing the "max.block.ms" is the right way to avoid this, 
since when meta data changes it will need some time to get updated metadata; 
and if auto topic creation it will also need enough time to wait for the topic 
created 

as from [~sslavic]'s comment, 
I am wondering if a better way of defining and utilizing RetriableException 
will help on this, maybe need some support from server side, that some 
exception is not retriable so client side would not need to waste the time to 
keep retrying. 

Or maybe consider my proposal on 
https://issues.apache.org/jira/browse/KAFKA-4385 to have another config to 
limit the consecutive failures on one topic. 

or maybe some adaptive behavior that the block time will be decreased after 
some consecutive failures. 









--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (KAFKA-4736) producer failed too slow when meta request failed

Reply via email to