[ 
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-3211:
----------------------------------
    Fix Version/s: 2.0.0

> Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler 
> restart
> --------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3211
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: gcp
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Julie Chien
>            Assignee: Julie Chien
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0, 2.0.0
>
>
> If the Airflow scheduler restarts (say, due to deployments, system updates, 
> or regular machine restarts such as the weekly restarts in GCP App Engine) 
> while it's running a job on GCP Dataproc, it'll lose track of that job, mark 
> the task as failed, and eventually retry. However, the jobs may still be 
> running on Dataproc and maybe even finish successfully. So when Airflow 
> retries and reruns the job, the same job will run twice. This can result in 
> issues like delayed workflows, increased costs, and duplicate data. 
>   
>  Setup to reproduce:
>  # Set up a GCP Project with the Dataproc API enabled
>  # Install Airflow.
>  # In the box that's running Airflow, {{pip install google-api-python-client 
> }}{{oauth2client}}
>  # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, 
> edit the {{google_cloud_default}} connection, and fill in the Project Id 
> field with your project ID.
> To reproduce:
>  # Install this DAG in the Airflow instance: 
> [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
>  Set up the Airflow variables as instructed at the top of the file.
>  # Start the Airflow scheduler and webserver if they're not running already. 
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster 
> to spin up and the job to start running on Dataproc.
>  # While the job's running, kill the scheduler. Wait 5 seconds or so, and 
> then start it back up.
>  # Airflow will retry the task. Click on the cluster in Dataproc to observe 
> that the job will have been resubmitted, even though the first job is still 
> running and may have even completed without error.
>   
>  At Etsy, we've customized the Dataproc operators to allow for the new 
> Airflow task to pick up where the old one left off upon Airflow restarts, and 
> have been successfully using our solution for the past 6 months. I will 
> submit a PR to merge this change upstream.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to