ASF GitHub Bot commented on FLINK-8487:
Github user StephanEwen commented on a diff in the pull request:
@@ -223,6 +224,81 @@
+ * Retry the given operation with the given delay in between successful
completions where the
+ * result does not match a given predicate.
+ * @param operation to retry
+ * @param retryDelay delay between retries
+ * @param deadline A deadline that specifies at what point we should
+ * @param acceptancePredicate Predicate to test whether the result is
+ * @param scheduledExecutor executor to be used for the retry operation
+ * @param <T> type of the result
+ * @return Future which retries the given operation a given amount of
times and delays the retry
+ * in case the predicate isn't matched
+ public static <T> CompletableFuture<T> retrySuccesfulWithDelay(
+ final Supplier<CompletableFuture<T>> operation,
+ final Time retryDelay,
--- End diff --
Deadline uses `Duration`, this method uses `Time`.
> State loss after multiple restart attempts
> Key: FLINK-8487
> URL: https://issues.apache.org/jira/browse/FLINK-8487
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.3.2
> Reporter: Fabian Hueske
> Priority: Blocker
> Fix For: 1.5.0, 1.4.3
> A user [reported this
> on the email@example.com mailing list and analyzed the situation.
> - A program that reads from Kafka and computes counts in a keyed 15 minute
> tumbling window. StateBackend is RocksDB and checkpointing is enabled.
> .timeWindow(Time.of(window_size, TimeUnit.MINUTES))
> .allowedLateness(Time.of(late_by, TimeUnit.SECONDS))
> .reduce(new ReduceFunction(), new WindowFunction())
> - At some point HDFS went into a safe mode due to NameNode issues
> - The following exception was thrown
> Operation category WRITE is not supported in state standby. Visit
> - The pipeline came back after a few restarts and checkpoint failures, after
> the HDFS issues were resolved.
> - It was evident that operator state was lost. Either it was the Kafka
> consumer that kept on advancing it's offset between a start and the next
> checkpoint failure (a minute's worth) or the the operator that had partial
> aggregates was lost.
> The user did some in-depth analysis (see [mail
> and might have (according to [~aljoscha]) identified the problem.
> [~stefanrichte...@gmail.com], can you have a look at this issue and check if
> it is relevant?
This message was sent by Atlassian JIRA