[
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julie Chien updated AIRFLOW-3211:
---------------------------------
Summary: Airflow losing track of running GCP Dataproc jobs upon Airflow
scheduler restart (was: Airflow losing track of running GCP Dataproc jobs upon
Airflow restart)
> Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler
> restart
> --------------------------------------------------------------------------------
>
> Key: AIRFLOW-3211
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
> Project: Apache Airflow
> Issue Type: Improvement
> Components: gcp
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Julie Chien
> Assignee: Julie Chien
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.9.0, 1.10.0
>
>
> If Airflow restarts (say, due to deployments, system updates, or regular
> machine restarts such as the weekly restarts in GCP App Engine) while it's
> running a job on GCP Dataproc, it'll lose track of that job, mark the task as
> failed, and eventually retry. However, the jobs may still be running on
> Dataproc and maybe even finish successfully. So when Airflow retries and
> reruns the job, the same job will run twice. This can result in issues like
> delayed workflows, increased costs, and duplicate data.
>
> Setup to reproduce:
> # Set up a GCP Project with the Dataproc API enabled
> # Install Airflow.
> # In the box that's running Airflow, {{pip install google-api-python-client
> }}{{oauth2client}}
> # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections,
> edit the {{google_cloud_default}} connection, and fill in the Project Id
> field with your project ID.
> To reproduce:
> # Install this DAG in the Airflow instance:
> [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
> Set up the Airflow variables as instructed at the top of the file.
> # Start the Airflow scheduler and webserver if they're not running already.
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster
> to spin up and the job to start running on Dataproc.
> # While the job's running, kill the scheduler. Wait 5 seconds or so, and
> then start it back up.
> # Airflow will retry the task. Click on the cluster in Dataproc to observe
> that the job will have been resubmitted, even though the first job is still
> running and may have even completed without error.
>
> At Etsy, we've customized the Dataproc operators to allow for the new
> Airflow task to pick up where the old one left off upon Airflow restarts, and
> have been successfully using our solution for the past 6 months. I will
> submit a PR to merge this change upstream.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)