Jared Stehler created FLINK-9030:
------------------------------------

             Summary: JobManager fails to archive job to FS when TM is lost
                 Key: FLINK-9030
                 URL: https://issues.apache.org/jira/browse/FLINK-9030
             Project: Flink
          Issue Type: Bug
          Components: History Server, JobManager, Mesos
    Affects Versions: 1.4.0
            Reporter: Jared Stehler


We are running flink on mesos, and are finding that when a job fails due to a 
task manager getting lost (from an OOM kill), the job isn't archived properly 
into the history server dir on the filesystem. 

When this happens, the job does appear in the finished listing in the job 
manager's in-memory archive view, and is accessible in the running job 
manager's rest api, but obviously not in the history server's rest api.

This is causing us issues as we are using the history server as a system of 
record for canceled or failed jobs in order to determine previous savepoint / 
external checkpoints.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to