[jira] [Commented] (FLINK-7240) Externalized RocksDB can fail with stackoverflow

Till Rohrmann (JIRA) Mon, 07 Aug 2017 08:10:23 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116716#comment-16116716
 ]


Till Rohrmann commented on FLINK-7240:
--------------------------------------

The underlying problem is the following: The {{ExternalizedCheckpointITCase}} 
executes multiple jobs per test on the same cluster. The individual jobs are 
stopped via a {{CancelJob}} message. The problem then is that we don't wait 
until the jobs have been completely cancelled. This is loosely enforced by a 
{{Thread.sleep}} call. If this timeout should not be enough (e.g. on Travis), 
then it might be the case that the next job will simply fail because it has not 
enough resources available. If we then try to request a new checkpoint, it will 
always fail. This combined with the recursive retry leads then to the 
{{StackOverflowError}}.

> Externalized RocksDB can fail with stackoverflow
> ------------------------------------------------
>
>                 Key: FLINK-7240
>                 URL: https://issues.apache.org/jira/browse/FLINK-7240
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing, Tests
>    Affects Versions: 1.3.1, 1.4.0
>         Environment: https://travis-ci.org/zentol/flink/jobs/255760513
>            Reporter: Chesnay Schepler
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: test-stability
>
> {code}
> testExternalizedFullRocksDBCheckpointsStandalone(org.apache.flink.test.checkpointing.ExternalizedCheckpointITCase)
>   Time elapsed: 146.894 sec  <<< ERROR!
> java.lang.StackOverflowError: null
>       at java.util.Hashtable.get(Hashtable.java:363)
>       at java.util.Properties.getProperty(Properties.java:969)
>       at java.lang.System.getProperty(System.java:720)
>       at sun.security.action.GetPropertyAction.run(GetPropertyAction.java:84)
>       at sun.security.action.GetPropertyAction.run(GetPropertyAction.java:49)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at java.io.PrintWriter.<init>(PrintWriter.java:116)
>       at java.io.PrintWriter.<init>(PrintWriter.java:100)
>       at 
> org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:58)
>       at 
> org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
>       at 
> org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
>       at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
>       at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>       at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>       at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>       at org.apache.log4j.Category.callAppenders(Category.java:206)
>       at org.apache.log4j.Category.forcedLog(Category.java:391)
>       at org.apache.log4j.Category.log(Category.java:856)
>       at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:381)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:392)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
>       at 
> org.apache.flink.runtime.testingUtils.TestingCluster.requestCheckpoint(TestingCluster.scala:394)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7240) Externalized RocksDB can fail with stackoverflow

Reply via email to