[ https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jarek Potiuk updated AIRFLOW-3211: ---------------------------------- Fix Version/s: 2.0.0 > Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler > restart > -------------------------------------------------------------------------------- > > Key: AIRFLOW-3211 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3211 > Project: Apache Airflow > Issue Type: Improvement > Components: gcp > Affects Versions: 1.9.0, 1.10.0 > Reporter: Julie Chien > Assignee: Julie Chien > Priority: Minor > Labels: pull-request-available > Fix For: 1.9.0, 1.10.0, 2.0.0 > > > If the Airflow scheduler restarts (say, due to deployments, system updates, > or regular machine restarts such as the weekly restarts in GCP App Engine) > while it's running a job on GCP Dataproc, it'll lose track of that job, mark > the task as failed, and eventually retry. However, the jobs may still be > running on Dataproc and maybe even finish successfully. So when Airflow > retries and reruns the job, the same job will run twice. This can result in > issues like delayed workflows, increased costs, and duplicate data. > > Setup to reproduce: > # Set up a GCP Project with the Dataproc API enabled > # Install Airflow. > # In the box that's running Airflow, {{pip install google-api-python-client > }}{{oauth2client}} > # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, > edit the {{google_cloud_default}} connection, and fill in the Project Id > field with your project ID. > To reproduce: > # Install this DAG in the Airflow instance: > [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py] > Set up the Airflow variables as instructed at the top of the file. > # Start the Airflow scheduler and webserver if they're not running already. > Kick off a run of the above DAG through the Airflow UI. Wait for the cluster > to spin up and the job to start running on Dataproc. > # While the job's running, kill the scheduler. Wait 5 seconds or so, and > then start it back up. > # Airflow will retry the task. Click on the cluster in Dataproc to observe > that the job will have been resubmitted, even though the first job is still > running and may have even completed without error. > > At Etsy, we've customized the Dataproc operators to allow for the new > Airflow task to pick up where the old one left off upon Airflow restarts, and > have been successfully using our solution for the past 6 months. I will > submit a PR to merge this change upstream. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)