[ 
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Chien updated AIRFLOW-3211:
---------------------------------
    Description: 
If the Airflow scheduler restarts (say, due to deployments, system updates, or 
regular machine restarts such as the weekly restarts in GCP App Engine) while 
it's running a job on GCP Dataproc, it'll lose track of that job, mark the task 
as failed, and eventually retry. However, the jobs may still be running on 
Dataproc and maybe even finish successfully. So when Airflow retries and reruns 
the job, the same job will run twice. This can result in issues like delayed 
workflows, increased costs, and duplicate data. 
  
 Setup to reproduce:
 # Set up a GCP Project with the Dataproc API enabled
 # Install Airflow.
 # In the box that's running Airflow, {{pip install google-api-python-client 
}}{{oauth2client}}
 # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, 
edit the {{google_cloud_default}} connection, and fill in the Project Id field 
with your project ID.

To reproduce:
 # Install this DAG in the Airflow instance: 
[https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
 Set up the Airflow variables as instructed at the top of the file.
 # Start the Airflow scheduler and webserver if they're not running already. 
Kick off a run of the above DAG through the Airflow UI. Wait for the cluster to 
spin up and the job to start running on Dataproc.
 # While the job's running, kill the scheduler. Wait 5 seconds or so, and then 
start it back up.
 # Airflow will retry the task. Click on the cluster in Dataproc to observe 
that the job will have been resubmitted, even though the first job is still 
running and may have even completed without error.
  
 At Etsy, we've customized the Dataproc operators to allow for the new Airflow 
task to pick up where the old one left off upon Airflow restarts, and have been 
successfully using our solution for the past 6 months. I will submit a PR to 
merge this change upstream.
  

  was:
If Airflow restarts (say, due to deployments, system updates, or regular 
machine restarts such as the weekly restarts in GCP App Engine) while it's 
running a job on GCP Dataproc, it'll lose track of that job, mark the task as 
failed, and eventually retry. However, the jobs may still be running on 
Dataproc and maybe even finish successfully. So when Airflow retries and reruns 
the job, the same job will run twice. This can result in issues like delayed 
workflows, increased costs, and duplicate data. 
  
 Setup to reproduce:
 # Set up a GCP Project with the Dataproc API enabled
 # Install Airflow.
 # In the box that's running Airflow, {{pip install google-api-python-client 
}}{{oauth2client}}
 # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, 
edit the {{google_cloud_default}} connection, and fill in the Project Id field 
with your project ID.

To reproduce:
 # Install this DAG in the Airflow instance: 
[https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
 Set up the Airflow variables as instructed at the top of the file.
 # Start the Airflow scheduler and webserver if they're not running already. 
Kick off a run of the above DAG through the Airflow UI. Wait for the cluster to 
spin up and the job to start running on Dataproc.
 # While the job's running, kill the scheduler. Wait 5 seconds or so, and then 
start it back up.
 # Airflow will retry the task. Click on the cluster in Dataproc to observe 
that the job will have been resubmitted, even though the first job is still 
running and may have even completed without error.
  
 At Etsy, we've customized the Dataproc operators to allow for the new Airflow 
task to pick up where the old one left off upon Airflow restarts, and have been 
successfully using our solution for the past 6 months. I will submit a PR to 
merge this change upstream.
  


> Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler 
> restart
> --------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3211
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: gcp
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Julie Chien
>            Assignee: Julie Chien
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0
>
>
> If the Airflow scheduler restarts (say, due to deployments, system updates, 
> or regular machine restarts such as the weekly restarts in GCP App Engine) 
> while it's running a job on GCP Dataproc, it'll lose track of that job, mark 
> the task as failed, and eventually retry. However, the jobs may still be 
> running on Dataproc and maybe even finish successfully. So when Airflow 
> retries and reruns the job, the same job will run twice. This can result in 
> issues like delayed workflows, increased costs, and duplicate data. 
>   
>  Setup to reproduce:
>  # Set up a GCP Project with the Dataproc API enabled
>  # Install Airflow.
>  # In the box that's running Airflow, {{pip install google-api-python-client 
> }}{{oauth2client}}
>  # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, 
> edit the {{google_cloud_default}} connection, and fill in the Project Id 
> field with your project ID.
> To reproduce:
>  # Install this DAG in the Airflow instance: 
> [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
>  Set up the Airflow variables as instructed at the top of the file.
>  # Start the Airflow scheduler and webserver if they're not running already. 
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster 
> to spin up and the job to start running on Dataproc.
>  # While the job's running, kill the scheduler. Wait 5 seconds or so, and 
> then start it back up.
>  # Airflow will retry the task. Click on the cluster in Dataproc to observe 
> that the job will have been resubmitted, even though the first job is still 
> running and may have even completed without error.
>   
>  At Etsy, we've customized the Dataproc operators to allow for the new 
> Airflow task to pick up where the old one left off upon Airflow restarts, and 
> have been successfully using our solution for the past 6 months. I will 
> submit a PR to merge this change upstream.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to