[
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jark Wu updated FLINK-13940:
----------------------------
Fix Version/s: (was: 1.9.1)
1.9.2
> S3RecoverableWriter causes job to get stuck in recovery
> -------------------------------------------------------
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
> Issue Type: Bug
> Components: Connectors / FileSystem
> Affects Versions: 1.8.0, 1.8.1, 1.9.0
> Reporter: Jimmy Weibel Rasmussen
> Assignee: Kostas Kloudas
> Priority: Major
> Fix For: 1.10.0, 1.9.2
>
>
>
> The cleaning up of tmp files in S3 introduced by this ticket/PR:
> https://issues.apache.org/jira/browse/FLINK-10963
> is preventing the flink job from being able to recover under some
> circumstances.
>
> This is what seems to be happening:
> When the jobs tries to recover, it will call initializeState() on all
> operators, which results in the Bucket.restoreInProgressFile method being
> called.
> This will download the part_tmp file mentioned in the checkpoint that we're
> restoring from, and finally it will call fsWriter.cleanupRecoverableState
> which deletes the part_tmp file in S3.
> Now the open() method is called on all operators. If the open() call fails
> for one of the operators (this might happen if the issue that caused the job
> to fail and restart is still unresolved), the job will fail again and try to
> restart from the same checkpoint as before. This time however, downloading
> the part_tmp file mentioned in the checkpoint fails because it was deleted
> during the last recover attempt.
> The bug is critical because it results in data loss.
>
>
>
> I discovered the bug because I have a flink job with a RabbitMQ source and a
> StreamingFileSink that writes to S3 (and therefore uses the
> S3RecoverableWriter).
> Occasionally I have some RabbitMQ connection issues which causes the job to
> fail and restart, sometimes the first few restart attempts fail because
> rabbitmq is unreachable when flink tries to reconnect.
>
> This is what I was seeing:
> RabbitMQ goes down
> Job fails because of a RabbitMQ ConsumerCancelledException
> Job attempts to restart but fails with a Rabbitmq connection exception (x
> number of times)
> RabbitMQ is back up
> Job attempts to restart but fails with a FileNotFoundException due to some
> _part_tmp file missing in S3.
>
> The job will be unable to restart and only option is to cancel and restart
> the job (and loose all state)
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)