[
https://issues.apache.org/jira/browse/FLINK-23323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402249#comment-17402249
]
Till Rohrmann commented on FLINK-23323:
---------------------------------------
Looking into this issue it looks as if we are running into FLINK-22893.
{code}
21:39:19,967 [mini-cluster-io-thread-8] INFO
org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices [] -
Finished cleaning up the high availability data for job
8d8d05cfd1a5314e3379c31835060746.
21:39:19,967 [ Curator-Framework-0] ERROR
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
[] - Background exception was not retry-able or retry gave up
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/flink/default/jobs/8d8d05cfd1a5314e3379c31835060746/leader
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:308)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CreateBuilderImpl$9.performBackgroundOperation(CreateBuilderImpl.java:801)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:84)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:965)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_282]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_282]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
21:39:19,969 [ Curator-Framework-0] WARN
org.apache.flink.runtime.minicluster.MiniCluster [] - Error in
MiniCluster. Shutting the MiniCluster down.
org.apache.flink.util.FlinkException: Exception during leader election of
DefaultDispatcherRunner occurred.
at
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner.handleError(DefaultDispatcherRunner.java:199)
[flink-runtime_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$LeaderElectionFatalErrorHandler.onFatalError(DefaultLeaderElectionService.java:315)
[flink-runtime_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.unhandledError(ZooKeeperLeaderElectionDriver.java:297)
[flink-runtime_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.handleBackgroundOperationException(CuratorFrameworkImpl.java:924)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1001)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_282]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_282]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.apache.flink.runtime.leaderelection.LeaderElectionException:
Unhandled error in ZooKeeperLeaderElectionDriver: Background exception was not
retry-able or retry gave up
... 18 more
Caused by:
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/flink/default/jobs/8d8d05cfd1a5314e3379c31835060746/leader
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:308)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CreateBuilderImpl$9.performBackgroundOperation(CreateBuilderImpl.java:801)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:84)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:965)
~[flink-shaded-zookeeper-3-3.4.14-13.0.jar:3.4.14-13.0]
... 9 more
21:39:19,973 [ Curator-Framework-0] INFO
org.apache.flink.runtime.minicluster.MiniCluster [] - Shutting down
Flink Mini Cluster
21:39:19,973 [flink-akka.actor.default-dispatcher-4] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Stopping
TaskExecutor akka://flink/user/rpc/taskmanager_1.
21:39:19,973 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Stopping
TaskExecutor akka://flink/user/rpc/taskmanager_0.
21:39:19,973 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close
ResourceManager connection db9ef8078dbfe684507ea68391c114ae.
21:39:19,973 [flink-akka.actor.default-dispatcher-4] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close
ResourceManager connection db9ef8078dbfe684507ea68391c114ae.
21:39:19,973 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Closing
TaskExecutor connection 5c835ca8-f734-4724-b781-a2c5d4de44dd because: The
TaskExecutor is shutting down.
21:39:19,973 [ Curator-Framework-0] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - Shutting down
rest endpoint.
21:39:19,974 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.state.TaskExecutorStateChangelogStoragesManager [] -
Shutting down TaskExecutorStateChangelogStoragesManager.
21:39:19,974 [flink-akka.actor.default-dispatcher-4] INFO
org.apache.flink.runtime.state.TaskExecutorStateChangelogStoragesManager [] -
Shutting down TaskExecutorStateChangelogStoragesManager.
21:39:19,977 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
JobGraph(jobId: 7a75c3344542508a40abbf20f5773ac6) to
ZooKeeperStateHandleStore{namespace='flink/default/jobgraphs'}.
21:39:19,978 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
Starting DefaultLeaderElectionService with
ZooKeeperLeaderElectionDriver{leaderPath='/jobs/7a75c3344542508a40abbf20f5773ac6/leader/connection_info'}.
21:39:19,980 [Flink Queryable State Proxy Server Thread 0] ERROR
org.apache.flink.queryablestate.network.AbstractServerHandler [] - Error while
handling request with ID 154
org.apache.flink.queryablestate.exceptions.UnknownLocationException: Could not
retrieve location of state=jungle of job=7a75c3344542508a40abbf20f5773ac6.
Potential reasons are: i) the state is not ready, or ii) the job does not exist.
at
org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getKvStateLookupInfo(KvStateClientProxyHandler.java:247)
~[classes/:?]
at
org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getState(KvStateClientProxyHandler.java:164)
~[classes/:?]
at
org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.executeActionAsync(KvStateClientProxyHandler.java:131)
~[classes/:?]
at
org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.handleRequest(KvStateClientProxyHandler.java:121)
~[classes/:?]
at
org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.handleRequest(KvStateClientProxyHandler.java:63)
~[classes/:?]
at
org.apache.flink.queryablestate.network.AbstractServerHandler$AsyncRequestTask.run(AbstractServerHandler.java:258)
~[flink-queryable-state-client-java-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_282]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
{code}
This does not explain the OOM but it explains all the subsequent test failures
because the {{MiniCluster}} is shut down.
Since FLINK-22893 has been fixed and I couldn't reproduce the OOM, I will close
this ticket as cannot reproduce.
> HAQueryableStateRocksDBBackendITCase failed due to heap OOM
> -----------------------------------------------------------
>
> Key: FLINK-23323
> URL: https://issues.apache.org/jira/browse/FLINK-23323
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Queryable State
> Affects Versions: 1.14.0
> Reporter: Xintong Song
> Priority: Major
> Labels: auto-deprioritized-critical, test-stability
> Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20195&view=logs&j=c91190b6-40ae-57b2-5999-31b869b0a7c1&t=43529380-51b4-5e90-5af4-2dccec0ef402&l=14431
> {code}
> Jul 08 21:43:22 [ERROR] Tests run: 12, Failures: 0, Errors: 9, Skipped: 1,
> Time elapsed: 246.345 s <<< FAILURE! - in
> org.apache.flink.queryablestate.itcases.HAQueryableStateRocksDBBackendITCase
> Jul 08 21:43:22 [ERROR]
> testReducingState(org.apache.flink.queryablestate.itcases.HAQueryableStateRocksDBBackendITCase)
> Time elapsed: 241.454 s <<< ERROR!
> Jul 08 21:43:22 java.lang.OutOfMemoryError: Java heap space
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)