[
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julie Chien updated AIRFLOW-3211:
---------------------------------
Description:
If the Airflow scheduler restarts (say, due to deployments, system updates, or
regular machine restarts such as the weekly restarts in GCP App Engine) while
it's running a job on GCP Dataproc, it'll lose track of that job, mark the task
as failed, and eventually retry. However, the jobs may still be running on
Dataproc and maybe even finish successfully. So when Airflow retries and reruns
the job, the same job will run twice. This can result in issues like delayed
workflows, increased costs, and duplicate data.
Setup to reproduce:
# Set up a GCP Project with the Dataproc API enabled
# Install Airflow.
# In the box that's running Airflow, {{pip install google-api-python-client
}}{{oauth2client}}
# Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections,
edit the {{google_cloud_default}} connection, and fill in the Project Id field
with your project ID.
To reproduce:
# Install this DAG in the Airflow instance:
[https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
Set up the Airflow variables as instructed at the top of the file.
# Start the Airflow scheduler and webserver if they're not running already.
Kick off a run of the above DAG through the Airflow UI. Wait for the cluster to
spin up and the job to start running on Dataproc.
# While the job's running, kill the scheduler. Wait 5 seconds or so, and then
start it back up.
# Airflow will retry the task. Click on the cluster in Dataproc to observe
that the job will have been resubmitted, even though the first job is still
running and may have even completed without error.
At Etsy, we've customized the Dataproc operators to allow for the new Airflow
task to pick up where the old one left off upon Airflow restarts, and have been
successfully using our solution for the past 6 months. I will submit a PR to
merge this change upstream.
was:
If Airflow restarts (say, due to deployments, system updates, or regular
machine restarts such as the weekly restarts in GCP App Engine) while it's
running a job on GCP Dataproc, it'll lose track of that job, mark the task as
failed, and eventually retry. However, the jobs may still be running on
Dataproc and maybe even finish successfully. So when Airflow retries and reruns
the job, the same job will run twice. This can result in issues like delayed
workflows, increased costs, and duplicate data.
Setup to reproduce:
# Set up a GCP Project with the Dataproc API enabled
# Install Airflow.
# In the box that's running Airflow, {{pip install google-api-python-client
}}{{oauth2client}}
# Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections,
edit the {{google_cloud_default}} connection, and fill in the Project Id field
with your project ID.
To reproduce:
# Install this DAG in the Airflow instance:
[https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
Set up the Airflow variables as instructed at the top of the file.
# Start the Airflow scheduler and webserver if they're not running already.
Kick off a run of the above DAG through the Airflow UI. Wait for the cluster to
spin up and the job to start running on Dataproc.
# While the job's running, kill the scheduler. Wait 5 seconds or so, and then
start it back up.
# Airflow will retry the task. Click on the cluster in Dataproc to observe
that the job will have been resubmitted, even though the first job is still
running and may have even completed without error.
At Etsy, we've customized the Dataproc operators to allow for the new Airflow
task to pick up where the old one left off upon Airflow restarts, and have been
successfully using our solution for the past 6 months. I will submit a PR to
merge this change upstream.
> Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler
> restart
> --------------------------------------------------------------------------------
>
> Key: AIRFLOW-3211
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
> Project: Apache Airflow
> Issue Type: Improvement
> Components: gcp
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Julie Chien
> Assignee: Julie Chien
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.9.0, 1.10.0
>
>
> If the Airflow scheduler restarts (say, due to deployments, system updates,
> or regular machine restarts such as the weekly restarts in GCP App Engine)
> while it's running a job on GCP Dataproc, it'll lose track of that job, mark
> the task as failed, and eventually retry. However, the jobs may still be
> running on Dataproc and maybe even finish successfully. So when Airflow
> retries and reruns the job, the same job will run twice. This can result in
> issues like delayed workflows, increased costs, and duplicate data.
>
> Setup to reproduce:
> # Set up a GCP Project with the Dataproc API enabled
> # Install Airflow.
> # In the box that's running Airflow, {{pip install google-api-python-client
> }}{{oauth2client}}
> # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections,
> edit the {{google_cloud_default}} connection, and fill in the Project Id
> field with your project ID.
> To reproduce:
> # Install this DAG in the Airflow instance:
> [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
> Set up the Airflow variables as instructed at the top of the file.
> # Start the Airflow scheduler and webserver if they're not running already.
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster
> to spin up and the job to start running on Dataproc.
> # While the job's running, kill the scheduler. Wait 5 seconds or so, and
> then start it back up.
> # Airflow will retry the task. Click on the cluster in Dataproc to observe
> that the job will have been resubmitted, even though the first job is still
> running and may have even completed without error.
>
> At Etsy, we've customized the Dataproc operators to allow for the new
> Airflow task to pick up where the old one left off upon Airflow restarts, and
> have been successfully using our solution for the past 6 months. I will
> submit a PR to merge this change upstream.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)