swuferhong opened a new issue, #2110:
URL: https://github.com/apache/fluss/issues/2110

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   0.8.0 (latest release)
   
   ### Please describe the bug 🐞
   
   Flink lookup will Intermittent timeout when Fluss cluster upgrading. Once a 
timeout occurs, it causes the Flink job to fail. This cannot be avoided no 
matter how large the `table.exec.async-lookup.timeout` is set.
   
   The error is as follow:
   ```
   java.lang.Exception: Could not complete the stream element: Record @ (undef) 
: +I(xxx)
        at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.completeExceptionally(AsyncWaitOperator.java:636)
        at 
org.apache.flink.streaming.api.functions.async.AsyncFunction.timeout(AsyncFunction.java:97)
        at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.timerTriggered(AsyncWaitOperator.java:654)
        at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$registerTimeout$1(AsyncWaitOperator.java:649)
        at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.lambda$registerTimer$2(AsyncWaitOperator.java:433)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:2186)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$deferCallbackToMailbox$27(StreamTask.java:2177)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:101)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:414)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:383)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:368)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:1202)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:1146)
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:976)
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:955)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:768)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580)
        at java.base/java.lang.Thread.run(Thread.java:991)
   Caused by: java.util.concurrent.TimeoutException: Async function call has 
timed out.
        ... 19 more
   ```
   
   The root cause is still unknown, but there are two likely possibilities:
   1. During upgrades, pods are recreated and their IP addresses change, which 
may cause metadata requests to take longer.
   2. The Netty connection timeout is set to 120 seconds 
(`client.connect-timeout`). If the client sends a request to an IP that no 
longer exists—but previously had an established connection—it may wait for the 
full 120 seconds before timing out.
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to