ASF GitHub Bot commented on FLINK-8487:
Github user aljoscha closed the pull request at:
> State loss after multiple restart attempts
> Key: FLINK-8487
> URL: https://issues.apache.org/jira/browse/FLINK-8487
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.3.2
> Reporter: Fabian Hueske
> Assignee: Aljoscha Krettek
> Priority: Blocker
> Fix For: 1.3.3, 1.5.0, 1.4.3
> A user [reported this
> on the firstname.lastname@example.org mailing list and analyzed the situation.
> - A program that reads from Kafka and computes counts in a keyed 15 minute
> tumbling window. StateBackend is RocksDB and checkpointing is enabled.
> .timeWindow(Time.of(window_size, TimeUnit.MINUTES))
> .allowedLateness(Time.of(late_by, TimeUnit.SECONDS))
> .reduce(new ReduceFunction(), new WindowFunction())
> - At some point HDFS went into a safe mode due to NameNode issues
> - The following exception was thrown
> Operation category WRITE is not supported in state standby. Visit
> - The pipeline came back after a few restarts and checkpoint failures, after
> the HDFS issues were resolved.
> - It was evident that operator state was lost. Either it was the Kafka
> consumer that kept on advancing it's offset between a start and the next
> checkpoint failure (a minute's worth) or the the operator that had partial
> aggregates was lost.
> The user did some in-depth analysis (see [mail
> and might have (according to [~aljoscha]) identified the problem.
> [~stefanrichte...@gmail.com], can you have a look at this issue and check if
> it is relevant?
This message was sent by Atlassian JIRA