swuferhong opened a new issue, #2110: URL: https://github.com/apache/fluss/issues/2110
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Fluss version 0.8.0 (latest release) ### Please describe the bug đ Flink lookup will Intermittent timeout when Fluss cluster upgrading. Once a timeout occurs, it causes the Flink job to fail. This cannot be avoided no matter how large the `table.exec.async-lookup.timeout` is set. The error is as follow: ``` java.lang.Exception: Could not complete the stream element: Record @ (undef) : +I(xxx) at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.completeExceptionally(AsyncWaitOperator.java:636) at org.apache.flink.streaming.api.functions.async.AsyncFunction.timeout(AsyncFunction.java:97) at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.timerTriggered(AsyncWaitOperator.java:654) at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$registerTimeout$1(AsyncWaitOperator.java:649) at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.lambda$registerTimer$2(AsyncWaitOperator.java:433) at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:2186) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$deferCallbackToMailbox$27(StreamTask.java:2177) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:101) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:414) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:383) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:368) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:1202) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:1146) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:976) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:955) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:768) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580) at java.base/java.lang.Thread.run(Thread.java:991) Caused by: java.util.concurrent.TimeoutException: Async function call has timed out. ... 19 more ``` The root cause is still unknown, but there are two likely possibilities: 1. During upgrades, pods are recreated and their IP addresses change, which may cause metadata requests to take longer. 2. The Netty connection timeout is set to 120 seconds (`client.connect-timeout`). If the client sends a request to an IP that no longer existsâbut previously had an established connectionâit may wait for the full 120 seconds before timing out. ### Solution _No response_ ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
