[ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:
----------------------------------------
    Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
        at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
        at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
        at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
        at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
        at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
        at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
        at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
        at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
        at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
        at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
        at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
        at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a broker without the patch (this can be reproduced in both ZK and 
KRaft mode)
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).


> InvalidReplicationFactorException at connect startup
> ----------------------------------------------------
>
>                 Key: KAFKA-15161
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15161
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, KafkaConnect
>    Affects Versions: 3.6.0
>            Reporter: Viktor Somogyi-Vass
>            Assignee: Viktor Somogyi-Vass
>            Priority: Major
>         Attachments: empty_metadata.patch
>
>
> h2. Problem description
> In our system test environment in certain cases due to a very specific timing 
> issue Connect may fail to start up. the problem lies in the very specific 
> timing of a Kafka cluster and connect start/restart. In these cases while the 
> broker doesn't have metadata and a consumer in connect starts and asks for 
> topic metadata, it returns the following exception and fails:
> {noformat}
> [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
> groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> org.apache.kafka.common.KafkaException: Unexpected error fetching metadata 
> for topic connect-offsets
>       at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
>       at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
>       at 
> org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
>       at 
> org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
>       at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
>       at 
> org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
> Replication factor is below 1 or larger than the number of available brokers.
> {noformat}
> Due to this error the connect node stops and it has to be manually restarted 
> (and ofc it fails the test scenarios as well).
> h2. Reproduction
> In my test scenario I had:
> - 1 broker
> - 1 connect distributed node
> - I also had a patch that I applied on the broker to make sure we don't have 
> metadata
> Steps to repro:
> # start up a broker without the patch
> # put a breakpoint here: 
> https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
> # start up a distributed connect node
> # restart the kafka broker with the patch to make sure there is no metadata
> # once the broker is started, release the debugger in connect
> It should run into the error cited above and shut down.
> This is not desirable, the connect cluster should retry to ensure its 
> continuous operation or the broker should handle this case somehow 
> differently, for instance by returning a RetriableException.
> The earliest I've tried this is 2.8 but I think this affects versions before 
> that as well (and after).
> Also it seems like some full metadata requests succeed during startup and 
> it's only the partial metadata request that fails, hence the first start of 
> the broker with metadata and then the restart without it (to simulate this 
> case).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to