[ 
https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037342#comment-17037342
 ] 

Andrey Klochkov commented on AIRFLOW-3211:
------------------------------------------

The functionality added by this story actually broke the behavior of the 
dataproc hook and made a few 1.10.x releases unusable for dataproc users. The 
problem is that the hook only uses the task ID part of the dataproc job ID when 
looking for previous invocations of the job, so if dataproc history still has 
jobs corresponding to any of the previous dag runs, the dataproc hook doesn't 
execute the job. 

A proper way to implement this would be to associate dataproc jobs with 
particular dag runs by e.g. embedding a dag run id hash in the dataproc job id. 

In any case this functionality has to be optional. In our experience, users 
expect dataproc jobs to be re-executed when they re-execute the task, and this 
new behavior creates a lot of confusion.

> Airflow losing track of running GCP Dataproc jobs upon Airflow scheduler 
> restart
> --------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3211
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: gcp
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Julie Chien
>            Assignee: Julie Chien
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.10.4
>
>
> If the Airflow scheduler restarts (say, due to deployments, system updates, 
> or regular machine restarts such as the weekly restarts in GCP App Engine) 
> while it's running a job on GCP Dataproc, it'll lose track of that job, mark 
> the task as failed, and eventually retry. However, the jobs may still be 
> running on Dataproc and maybe even finish successfully. So when Airflow 
> retries and reruns the job, the same job will run twice. This can result in 
> issues like delayed workflows, increased costs, and duplicate data. 
>   
>  Setup to reproduce:
>  # Set up a GCP Project with the Dataproc API enabled
>  # Install Airflow.
>  # In the box that's running Airflow, {{pip install google-api-python-client 
> }}{{oauth2client}}
>  # Start the Airflow webserver. In the Airflow UI, Go to Admin->Connections, 
> edit the {{google_cloud_default}} connection, and fill in the Project Id 
> field with your project ID.
> To reproduce:
>  # Install this DAG in the Airflow instance: 
> [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py]
>  Set up the Airflow variables as instructed at the top of the file.
>  # Start the Airflow scheduler and webserver if they're not running already. 
> Kick off a run of the above DAG through the Airflow UI. Wait for the cluster 
> to spin up and the job to start running on Dataproc.
>  # While the job's running, kill the scheduler. Wait 5 seconds or so, and 
> then start it back up.
>  # Airflow will retry the task. Click on the cluster in Dataproc to observe 
> that the job will have been resubmitted, even though the first job is still 
> running and may have even completed without error.
>   
>  At Etsy, we've customized the Dataproc operators to allow for the new 
> Airflow task to pick up where the old one left off upon Airflow restarts, and 
> have been successfully using our solution for the past 6 months. I will 
> submit a PR to merge this change upstream.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to