Jun Yao created KAFKA-4736:
------------------------------
Summary: producer failed too slow when meta request failed
Key: KAFKA-4736
URL: https://issues.apache.org/jira/browse/KAFKA-4736
Project: Kafka
Issue Type: Bug
Components: producer
Reporter: Jun Yao
This might be similar to https://issues.apache.org/jira/browse/KAFKA-4385 but
happen in a different case.
In some cases as tested, the producer may get some invalid metadata (and it's
always invalid when there are some issues on the broker side),
so whenever calling KafkaProducer.send(), it will spent 60seconds (in default
configuration) on KafkaProducer.waitOnMetadata() and then throw
TimeoutException("Failed to update metadata after 60000 ms"),
so when there are something wrong on some topic that the producer did not get
the metadata of the topic, it will be like 'blocked' by this topic for
60seconds, and impacting other topics sending.
for cases that we want to utilizing the "Callback" to save those failed
requests data in a different place and retry later, the Callback is also called
every 60seconds. so if upstream is keep receiving data and calling
producer.send(),
it will soon be blocking or buffering too much in memory if upstream has a
buffer of data before calling producer.send.
It looks to me the KafkaProducer.send() is failing too slow (not fail fast)
when something is wrong on some topic/broker. and it's always in this slow
failure state.
I am not sure if reducing the "max.block.ms" is the right way to avoid this,
since when meta data changes it will need some time to get updated metadata;
and if auto topic creation it will also need enough time to wait for the topic
created
as from [~sslavic]'s comment,
I am wondering if a better way of defining and utilizing RetriableException
will help on this, maybe need some support from server side, that some
exception is not retriable so client side would not need to waste the time to
keep retrying.
Or maybe consider my proposal on
https://issues.apache.org/jira/browse/KAFKA-4385 to have another config to
limit the consecutive failures on one topic.
or maybe some adaptive behavior that the block time will be decreased after
some consecutive failures.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)