[ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-15161:
-------------------------------------------

    Assignee: Viktor Somogyi-Vass

> InvalidReplicationFactorException at connect startup
> ----------------------------------------------------
>
>                 Key: KAFKA-15161
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15161
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients, KafkaConnect
>    Affects Versions: 3.6.0
>            Reporter: Viktor Somogyi-Vass
>            Assignee: Viktor Somogyi-Vass
>            Priority: Major
>
> h2. Problem description
> In our system test environment in certain cases due to a very specific timing 
> issue Connect may fail to start up. the problem lies in the very specific 
> timing of a Kafka cluster and connect start/restart. In these cases while the 
> broker doesn't have metadata and a consumer in connect starts and asks for 
> topic metadata, it returns the following exception and fails:
> {noformat}
> [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
> groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> org.apache.kafka.common.KafkaException: Unexpected error fetching metadata 
> for topic connect-offsets
>       at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
>       at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
>       at 
> org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
>       at 
> org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
>       at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
>       at 
> org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
> Replication factor is below 1 or larger than the number of available brokers.
> {noformat}
> Due to this error the connect node stops and it has to be manually restarted 
> (and ofc it fails the test scenarios as well).
> h2. Reproduction
> In my test scenario I had:
> - 1 broker
> - 1 connect distributed node
> - I also had a patch that I applied on the broker to make sure we don't have 
> metadata
> Steps to repro:
> # start up a zookeeper based broker without the patch
> # put a breakpoint here: 
> https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
> # start up a distributed connect node
> # restart the kafka broker with the patch to make sure there is no metadata
> # once the broker is started, release the debugger in connect
> It should run into the error cited above and shut down.
> This is not desirable, the connect cluster should retry to ensure its 
> continuous operation or the broker should handle this case somehow 
> differently, for instance by returning a RetriableException.
> The earliest I've tried this is 2.8 but I think this affects versions before 
> that as well (and after).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to