[
https://issues.apache.org/jira/browse/FLINK-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116716#comment-16116716
]
Till Rohrmann commented on FLINK-7240:
--------------------------------------
The underlying problem is the following: The {{ExternalizedCheckpointITCase}}
executes multiple jobs per test on the same cluster. The individual jobs are
stopped via a {{CancelJob}} message. The problem then is that we don't wait
until the jobs have been completely cancelled. This is loosely enforced by a
{{Thread.sleep}} call. If this timeout should not be enough (e.g. on Travis),
then it might be the case that the next job will simply fail because it has not
enough resources available. If we then try to request a new checkpoint, it will
always fail. This combined with the recursive retry leads then to the
{{StackOverflowError}}.
> Externalized RocksDB can fail with stackoverflow
> ------------------------------------------------
>
> Key: FLINK-7240
> URL: https://issues.apache.org/jira/browse/FLINK-7240
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing, Tests
> Affects Versions: 1.3.1, 1.4.0
> Environment: https://travis-ci.org/zentol/flink/jobs/255760513
> Reporter: Chesnay Schepler
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: test-stability
>
> {code}
> testExternalizedFullRocksDBCheckpointsStandalone(org.apache.flink.test.checkpointing.ExternalizedCheckpointITCase)
> Time elapsed: 146.894 sec <<< ERROR!
> java.lang.StackOverflowError: null
> at java.util.Hashtable.get(Hashtable.java:363)
> at java.util.Properties.getProperty(Properties.java:969)
> at java.lang.System.getProperty(System.java:720)
> at sun.security.action.GetPropertyAction.run(GetPropertyAction.java:84)
> at sun.security.action.GetPropertyAction.run(GetPropertyAction.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.PrintWriter.<init>(PrintWriter.java:116)
> at java.io.PrintWriter.<init>(PrintWriter.java:100)
> at
> org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:58)
> at
> org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
> at
> org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
> at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
> at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
> at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
> at
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
> at org.apache.log4j.Category.callAppenders(Category.java:206)
> at org.apache.log4j.Category.forcedLog(Category.java:391)
> at org.apache.log4j.Category.log(Category.java:856)
> at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:381)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:392)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> at
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> ...
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)