[ 
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654004#comment-16654004
 ] 

Ash Berlin-Taylor commented on AIRFLOW-3211:
--------------------------------------------

(Just a note that you shouldn't need to restart Airflow on deployment to get it 
to pick up new/changed dags)

> Airflow losing track of running GCP Dataproc jobs upon Airflow restart
> ----------------------------------------------------------------------
>
>                 Key: AIRFLOW-3211
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: gcp
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Julie Chien
>            Assignee: Julie Chien
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0
>
>
> If Airflow restarts (say, due to deployments, system updates, or regular 
> machine restarts such as the weekly restarts in GCP App Engine) while it's 
> running a job on GCP Dataproc, it'll lose track of that job, mark the task as 
> failed, and eventually retry. However, the jobs may still be running on 
> Dataproc and maybe even finish successfully. So when Airflow retries and 
> reruns the job, the same job will run twice. This can result in issues like 
> delayed workflows, increased costs, and duplicate data. 
>   
>  To reproduce:
> Setup:
>  # Install Airflow.
>  # Set up a GCP Project with the Dataproc API enabled
>  # In the box that's running Airflow, {{pip install google-api-python-client 
> }}{{oauth2client}}
>  # Install this DAG in the Airflow instance: 
> https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py
>  Set up the Airflow variables as instructed at the top of the file.
>  # Start the Airflow scheduler and webserver if they're not running already. 
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster 
> to spin up and the job to start running on Dataproc.
>  # While the job's running, kill the scheduler and webserver, and then start 
> them back up.
>  # Wait for Airflow to retry the task. Click on the cluster in Dataproc to 
> observe that the job will have been resubmitted, even though the first job is 
> still running without error.
>   
>  At Etsy, we've customized the Dataproc operators to allow for the new 
> Airflow task to pick up where the old one left off upon Airflow restarts, and 
> have been happily using our solution for the past 6 months. I'd like to 
> submit a PR to merge this change upstream.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to