Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/1954#discussion_r67513578
--- Diff:
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java
---
@@ -174,71 +140,54 @@ private void validateConstraints(ExecutionGraph eg) {
@Test
public void testRestartAutomatically() throws Exception {
- Instance instance = ExecutionGraphTestUtils.getInstance(
- new
SimpleActorGateway(TestingUtils.directExecutionContext()),
- NUM_TASKS);
+ RestartStrategy restartStrategy = new
FixedDelayRestartStrategy(1, 1000);
+ Tuple2<ExecutionGraph, Instance> executionGraphInstanceTuple =
createExecutionGraph(restartStrategy);
+ ExecutionGraph eg = executionGraphInstanceTuple.f0;
- Scheduler scheduler = new
Scheduler(TestingUtils.defaultExecutionContext());
- scheduler.newInstanceAvailable(instance);
-
- JobVertex sender = new JobVertex("Task");
- sender.setInvokableClass(Tasks.NoOpInvokable.class);
- sender.setParallelism(NUM_TASKS);
-
- JobGraph jobGraph = new JobGraph("Pointwise job", sender);
-
- ExecutionGraph eg = new ExecutionGraph(
- TestingUtils.defaultExecutionContext(),
- new JobID(),
- "Test job",
- new Configuration(),
- ExecutionConfigTest.getSerializedConfig(),
- AkkaUtils.getDefaultTimeout(),
- new FixedDelayRestartStrategy(1, 1000));
-
eg.attachJobGraph(jobGraph.getVerticesSortedTopologicallyFromSources());
+ restartAfterFailure(eg, new FiniteDuration(2,
TimeUnit.MINUTES), true);
+ }
- assertEquals(JobStatus.CREATED, eg.getState());
+ @Test
+ public void taskShouldFailWhenFailureRateLimitExceeded() throws
Exception {
+ FailureRateRestartStrategy restartStrategy = new
FailureRateRestartStrategy(2, TimeUnit.SECONDS, 0);
+ FiniteDuration timeout = new FiniteDuration(50,
TimeUnit.MILLISECONDS);
+ Tuple2<ExecutionGraph, Instance> executionGraphInstanceTuple =
createExecutionGraph(restartStrategy);
+ ExecutionGraph eg = executionGraphInstanceTuple.f0;
+
+ restartAfterFailure(eg, timeout, false);
+ restartAfterFailure(eg, timeout, false);
+ //failure rate limit not exceeded yet, so task is running
+ assertEquals(JobStatus.RUNNING, eg.getState());
+ Thread.sleep(1000); //wait for a second to restart limit rate
- eg.scheduleForExecution(scheduler);
+ restartAfterFailure(eg, timeout, false);
+ restartAfterFailure(eg, timeout, false);
+ makeAFailureAndWait(eg, timeout);
--- End diff --
Can we try to harden this test a little bit. The problem is that on Travis
concurrent executions (e.g. the restart future) can take quite some time. Thus,
it might easily happen that we run into the 50 milliseconds timeout or that the
three failures don't occur within one second, even though that the test passes
without problem on your local machine.
I think it would be better to split the test so that you treat the first
half and the second half in separate test cases. In the second test case, we
should increase the failure interval to make sure that we can produce 3
failures within that time interval.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---