[ 
https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552939#comment-16552939
 ] 

ASF GitHub Bot commented on FLINK-9900:
---------------------------------------

GitHub user zentol opened a pull request:

    https://github.com/apache/flink/pull/6395

    [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase

    ## What is the purpose of the change
    
    This PR makes a few modifications to the `ZooKeeperHighAvailabilityITCase` 
to reduce the chances for intermittent test failures and timeouts.
    
    Changes:
    ## 1)
    The test was moving files out of the HA storage directory with a simple 
loop using `File#renameTo`. The test enforced that the moving is successful, 
however since old checkpoints may be deleted asynchronously this may not always 
be the case.
    We now use a `FileVisitor` and ignore `IOExceptions` that occur while 
moving.
    If no checkpoint file could be moved the test will still fail.
    
    ## 2)
    After the checkpoint files were moved out of the HA storage directory the 
job is thrown into a restart loop. To verify the restart behavior the test was 
polling the job state and checked for the `RESTARTING` and `FAILING` states.
    Due to the small size the job is in these states only for a short time, 
effectively adding a race condition. Thus this loop mayrun for longer than 
anticipated; the largest outlier i got locally was 50 seconds which isn't 
_that_ for off from the 2 minute timeout. I suspect this to be the failure 
cause raised in the JIRA, but I can't guarantee it.
    Instead we now access the `fullRestarts` metric using a custom reporter to 
check how many restarts have occurred. The actual _state transitions_ should be 
irrelevant to the test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink 9900

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6395
    
----
commit b8827dc3723558c52ad567bf88f24ae34129ea08
Author: zentol <chesnay@...>
Date:   2018-07-23T14:21:32Z

    [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase

----


> Failed to testRestoreBehaviourWithFaultyStateHandles 
> (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) 
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-9900
>                 URL: https://issues.apache.org/jira/browse/FLINK-9900
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: zhangminglei
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.6.0
>
>
> https://api.travis-ci.org/v3/job/405843617/log.txt
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec 
> <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase
>  
> testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
>  Time elapsed: 120.036 sec <<< ERROR!
>  org.junit.runners.model.TestTimedOutException: test timed out after 120000 
> milliseconds
>  at sun.misc.Unsafe.park(Native Method)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>  at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>  at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244)
> Results :
> Tests in error: 
>  
> ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244
>  » TestTimedOut
> Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to