[
https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552939#comment-16552939
]
ASF GitHub Bot commented on FLINK-9900:
---------------------------------------
GitHub user zentol opened a pull request:
https://github.com/apache/flink/pull/6395
[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase
## What is the purpose of the change
This PR makes a few modifications to the `ZooKeeperHighAvailabilityITCase`
to reduce the chances for intermittent test failures and timeouts.
Changes:
## 1)
The test was moving files out of the HA storage directory with a simple
loop using `File#renameTo`. The test enforced that the moving is successful,
however since old checkpoints may be deleted asynchronously this may not always
be the case.
We now use a `FileVisitor` and ignore `IOExceptions` that occur while
moving.
If no checkpoint file could be moved the test will still fail.
## 2)
After the checkpoint files were moved out of the HA storage directory the
job is thrown into a restart loop. To verify the restart behavior the test was
polling the job state and checked for the `RESTARTING` and `FAILING` states.
Due to the small size the job is in these states only for a short time,
effectively adding a race condition. Thus this loop mayrun for longer than
anticipated; the largest outlier i got locally was 50 seconds which isn't
_that_ for off from the 2 minute timeout. I suspect this to be the failure
cause raised in the JIRA, but I can't guarantee it.
Instead we now access the `fullRestarts` metric using a custom reporter to
check how many restarts have occurred. The actual _state transitions_ should be
irrelevant to the test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zentol/flink 9900
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/6395.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6395
----
commit b8827dc3723558c52ad567bf88f24ae34129ea08
Author: zentol <chesnay@...>
Date: 2018-07-23T14:21:32Z
[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase
----
> Failed to testRestoreBehaviourWithFaultyStateHandles
> (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-9900
> URL: https://issues.apache.org/jira/browse/FLINK-9900
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.5.1, 1.6.0
> Reporter: zhangminglei
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.6.0
>
>
> https://api.travis-ci.org/v3/job/405843617/log.txt
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec
> <<< FAILURE! - in
> org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase
>
> testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
> Time elapsed: 120.036 sec <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 120000
> milliseconds
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244)
> Results :
> Tests in error:
>
> ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244
> » TestTimedOut
> Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)