chickenchickenlove commented on code in PR #20159: URL: https://github.com/apache/kafka/pull/20159#discussion_r2205941778
########## clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java: ########## @@ -1225,7 +1230,9 @@ private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long if (metadata.getError(topic) != null) { throw new TimeoutException(errorMessage, metadata.getError(topic).exception()); } - throw new TimeoutException(errorMessage); + if (ex.getCause() != null) + throw new TimeoutException(errorMessage, ex.getCause()); + throw new TimeoutException(errorMessage, new PotentialCauseException("Metadata update timed out ― topic missing, auth denied, broker/partition unavailable, or client sender/buffer stalled.")); Review Comment: @chia7712 thanks for your comment. 🙇♂️ I agree your opinion especially that 'we could enrich the error message instead of introducing a new nonspecific exception'. However, Introducing a new nonspecific exception to solve this issue has has pros and cons. Let me explain more. The `KAFKA-17019` requires that all `org.apache.kafka.common.errors.TimeoutException` should include root cause as a nested exception. ### It’s hard to pinpoint the root cause in every situation where a `TimeoutException` occurs https://github.com/apache/kafka/blob/c6cf5175f62f78c266329cc87d697cc632e498fd/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L405-L418 In here, `expiredInflightBatches` and `expiredBatches` can cause `TimeoutException`. However, `TimeoutException` is thrown as a result of calculating elapsed time. Therefore, `expiredInflightBatches` and `expiredBatches` don't have information about root cause. If `expiredInflight` encounters some problem -- 1. Normal Case - No errors and succeed. 2. Network Issue - Error might occurs after the expired period. 3. a little bit slow network issue because of Bandwidth, and so on. - No errors and slow response. 4. Busy CPU due to various reasons - No errors and called slowly. There are many possible scenarios where a `TimeoutException` can occur, even with a simple analysis. However, the `expiredInFlight` instance can actually throw an error in only 2 (Maybe connection closed, fail to establish connection, ...) After spending a considerable amount of time reviewing the code and analyzing the `TimeoutExceptions` thrown by the Producer, I concluded that it's difficult to extract the root cause at every point where a `TimeoutException` is created. ### Idea + Pros and Cons Developers usually have a good understanding of what kind of error might occur in a given context. Therefore, in cases where it's difficult to catch the actual root cause, it's possible to include an expected exception as the root cause instead. Here’s a summary of the pros and cons (compared to simply enhancing the error message): - pros : At call sites(For example, `kafka streams`, `kafka connect`, `kafka producer internal` and so on) where a `TimeoutException` is expected, the root cause can be used to handle different cases conditionally. - For example, you could create `NetworkPotentialCauseException` and `CPUBusyPotentialCauseException` as subclasses of `PotentialCauseException`, and handle different branches based on the root cause—branch A if the root cause is a `NetworkPotentialCauseException`, and branch B if it’s a `CPUBusyPotentialCauseException`. - cons : Higher instance creation cost compared to `String` instance. I spent quite a bit of time thinking through the direction of this PR. What are your thoughts on it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org