chickenchickenlove commented on code in PR #20159:
URL: https://github.com/apache/kafka/pull/20159#discussion_r2205941778


##########
clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java:
##########
@@ -1225,7 +1230,9 @@ private ClusterAndWaitTime waitOnMetadata(String topic, 
Integer partition, long
                 if (metadata.getError(topic) != null) {
                     throw new TimeoutException(errorMessage, 
metadata.getError(topic).exception());
                 }
-                throw new TimeoutException(errorMessage);
+                if (ex.getCause() != null)
+                    throw new TimeoutException(errorMessage, ex.getCause());
+                throw new TimeoutException(errorMessage, new 
PotentialCauseException("Metadata update timed out ― topic missing, auth 
denied, broker/partition unavailable, or client sender/buffer stalled."));

Review Comment:
   @chia7712 thanks for your comment. 🙇‍♂️ 
   I agree your opinion especially that 'we could enrich the error message 
instead of introducing a new nonspecific exception'. 
   However, Introducing a new nonspecific exception to solve this issue has has 
pros and cons.
   
   Let me explain more. 
   
   The `KAFKA-17019` requires that all 
`org.apache.kafka.common.errors.TimeoutException` should include root cause as 
a nested exception.
   
   ### It’s hard to pinpoint the root cause in every situation where a 
`TimeoutException` occurs
   
   
https://github.com/apache/kafka/blob/c6cf5175f62f78c266329cc87d697cc632e498fd/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L405-L418
   
   In here, `expiredInflightBatches` and `expiredBatches` can cause 
`TimeoutException`.
   However, `TimeoutException` is thrown as a result of calculating elapsed 
time. 
   Therefore, `expiredInflightBatches` and `expiredBatches` don't have 
information about root cause. 
   
   If `expiredInflight` encounters some problem --
   1. Normal Case - No errors and succeed. 
   2. Network Issue - Error might occurs after the expired period.
   3. a little bit slow network issue because of Bandwidth, and so on. - No 
errors and slow response.
   4. Busy CPU due to various reasons - No errors and called slowly.
   
   There are many possible scenarios where a `TimeoutException` can occur, even 
with a simple analysis.
   However, the `expiredInFlight` instance can actually throw an error in only 
2 (Maybe connection closed, fail to establish connection, ...) 
   
   After spending a considerable amount of time reviewing the code and 
analyzing the `TimeoutExceptions` thrown by the Producer, I concluded that it's 
difficult to extract the root cause at every point where a `TimeoutException` 
is created.
   
   ### Idea + Pros and Cons
   Developers usually have a good understanding of what kind of error might 
occur in a given context.
   Therefore, in cases where it's difficult to catch the actual root cause, 
it's possible to include an expected exception as the root cause instead.
   
   Here’s a summary of the pros and cons (compared to simply enhancing the 
error message):
   - pros : At call sites(For example, `kafka streams`, `kafka connect`, `kafka 
producer internal` and so on) where a `TimeoutException` is expected, the root 
cause can be used to handle different cases conditionally.
     -   For example, you could create `NetworkPotentialCauseException` and 
`CPUBusyPotentialCauseException` as subclasses of `PotentialCauseException`, 
and handle different branches based on the root cause—branch A if the root 
cause is a `NetworkPotentialCauseException`, and branch B if it’s a 
`CPUBusyPotentialCauseException`.
   - cons : Higher instance creation cost compared to `String` instance.
   
   
   I spent quite a bit of time thinking through the direction of this PR.
   What are your thoughts on it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to