Qinghui Xu created KAFKA-8790:
---------------------------------
Summary: [kafka-connect] KafkaBaseLog.WorkThread not recoverable
Key: KAFKA-8790
URL: https://issues.apache.org/jira/browse/KAFKA-8790
Project: Kafka
Issue Type: Bug
Reporter: Qinghui Xu
We have a kafka (source) connector that's copying data from some kafka cluster
to the target cluster. The connector is deployed to a bunch of workers running
on mesos, thus the lifecycle of the workers are managed by mesos. Workers
should be recovered by mesos in case of failure, and then source tasks will
rely on kafka connect's KafkaOffsetBackingStore to recover the offsets to
proceed.
Recently we witness some unrecoverable situation, though: worker is not doing
anything after some network reset on the host where the worker is running. More
specifically, it seems that the kafka connect tasks' on that worker stop to
poll source kafka cluster, because the consumers are stuck in a rebalance state.
After some digging, we found that the thread to handle the source task offset
recovery is dead, which makes the all rebalancing tasks stuck in the state of
reading back the offset. The log we saw in our connect task:
{code:java}
2019-08-12 14:29:28,089 ERROR Unexpected exception in Thread[KafkaBasedLog Work
Thread - kc_replicator_offsets,5,main]
(org.apache.kafka.connect.util.KafkaBasedLog)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times
in 30001ms{code}
As far as I can see
([https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L339]),
the thread will be dead in case of error, while the worker is still alive,
which means a worker without the thread to recover offset thus all tasks on
that worker are not recoverable and will stuck in case of failure.
Solution to fix this issue will ideally either of the following:
* Make the KafkaBasedLog Work Thread recoverable from error
* Or KafkaBasedLog Work Thread death should make the worker exit (a finally
clause to call System.exit), then the worker lifecycle management (in our case,
it's mesos) will restart the worker elsewhere
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)