[
https://issues.apache.org/jira/browse/FLINK-21416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302508#comment-17302508
]
Piotr Nowojski edited comment on FLINK-21416 at 3/16/21, 1:15 PM:
------------------------------------------------------------------
The amount of those failures is a bit suspicious. Has something changed
recently either in the tests setup or the blocking partition that could be
related?
I have suspicion, especially after:
{noformat}
Caused by:
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandshakeTimeoutException:
handshake timed out after 10000ms
at
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2054)
... 8 more
{noformat}
that this might be cause by us doing blocking io in the netty threads when
using blocking partition. This handshake time out could be easily explained by
that. Also in the 5 reported failures that I checked, "connection reset by
peer" happens around 60s into the test. Maybe this is also a problem where
server frozen for x seconds, causing the client to timeout (error got lost?
maybe it's in the logs?), and server side detected this and failed with
"connection reset by peer"?
Another pointer, it seems like all of the failures happened with SSL enabled.
I don't know why has it started to fail now so frequently. Maybe something
changed in the test setup or in the environment. I was always afraid that doing
a blocking IO in the netty threads can cause problems, but that's not something
we can easily change (assuming this is causing those issues). Maybe we can
speed up the test? Make it lighter? Or maybe we can increase some timeouts
(related to ssl?) either in Netty or in tcp stack? What worries me is that
apparently this issue is happening not only in the ITCase here, but also on
(cluster?) benchmarks?
was (Author: pnowojski):
The amount of those failures is a bit suspicious. Has something changed
recently either in the tests setup or the blocking partition that could be
related?
I have suspicion, especially after:
{noformat}
Caused by:
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandshakeTimeoutException:
handshake timed out after 10000ms
at
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2054)
... 8 more
{noformat}
that this might be cause by us doing blocking io in the netty threads when
using blocking partition. This handshake time out could be easily explained by
that. Also in the 5 reported failures that I checked, "connection reset by
peer" happens around 60s into the test. Maybe this is also a problem where
server frozen for x seconds, causing the client to timeout (error got lost?
maybe it's in the logs?), and server side detected this and failed with
"connection reset by peer"?
Another pointer, it seems like all of the failures happened with SSL enabled.
I don't know why has it started to fail now so frequently. Maybe something
changed in the test setup or in the environment. I was always afraid that
blocking IO in the netty threads can cause problems, but that's not something
we can easily change (assuming this is causing those issues). Maybe we can
speed up the test? Make it lighter? Or maybe we can increase some timeouts
(related to ssl?) either in Netty or in tcp stack? What worries me is that
apparently this issue is happening not only in the ITCase here, but also on
(cluster?) benchmarks?
> FileBufferReaderITCase.testSequentialReading fails on azure
> -----------------------------------------------------------
>
> Key: FLINK-21416
> URL: https://issues.apache.org/jira/browse/FLINK-21416
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.13.0
> Reporter: Dawid Wysakowicz
> Assignee: Guo Weijie
> Priority: Critical
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13473&view=logs&j=59c257d0-c525-593b-261d-e96a86f1926b&t=b93980e3-753f-5433-6a19-13747adae66a
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> at
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> at
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:811)
> at
> org.apache.flink.runtime.io.network.partition.FileBufferReaderITCase.testSequentialReading(FileBufferReaderITCase.java:128)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.junit.runners.Suite.runChild(Suite.java:128)
> at org.junit.runners.Suite.runChild(Suite.java:27)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:117)
> at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:79)
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:221)
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:212)
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:203)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:650)
> at
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:81)
> at
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:435)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection reset by peer
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)