[ 
https://issues.apache.org/jira/browse/AIRFLOW-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jessica Laughlin updated AIRFLOW-2009:
--------------------------------------
    Description: 
We have been using the DataFlowOperator to schedule DataFlow jobs.

We found that the DataFlowHook used by the DataFlowOperator doesn't actually 
use the passed `gcp_conn_id` to schedule the DataFlow job, but only to read the 
results after. 

code 
(https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py#L158):
        _Dataflow(cmd).wait_for_done()
        _DataflowJob(self.get_conn(), variables['project'],
                     name, self.poll_sleep).wait_for_done()

The first line here should also be using self.get_conn(). 

For this reason, our tasks using the DataFlowOperator have actually been using 
the default Google Compute Engine service account (which has DataFlow 
permissions) to schedule DataFlow jobs. It is only when our provided service 
account (which does not have DataFlow permissions) is used in the second line 
that we are seeing a permissions error. 

I would like to fix this bug, but have to work around it at the moment due to 
time constraints. 

  was:
We have been using the DataFlowOperator to schedule DataFlow jobs.

We found that the DataFlowHook used by the DataFlowOperator doesn't actually 
use the passed `gcp_conn_id` to schedule the DataFlow job, but only to read the 
results after. 

code 
(https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py#L158):
        _Dataflow(cmd).wait_for_done()
        _DataflowJob(self.get_conn(), variables['project'],
                     name, self.poll_sleep).wait_for_done()

The first line here should also be using self.get_conn(). 

For this reason, our tasks using the DataFlowOperator have actually been using 
the default Google Compute Engine service account (which has DataFlow 
permissions) to schedule DataFlow jobs. It is only when our provided service 
account (which does not have DataFlow permissions) is used in the second line 
that we are seeing a permissions error. 


> DataFlowHook does not use correct service account
> -------------------------------------------------
>
>                 Key: AIRFLOW-2009
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2009
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: Dataflow, hooks
>    Affects Versions: Airflow 2.0
>            Reporter: Jessica Laughlin
>            Priority: Major
>
> We have been using the DataFlowOperator to schedule DataFlow jobs.
> We found that the DataFlowHook used by the DataFlowOperator doesn't actually 
> use the passed `gcp_conn_id` to schedule the DataFlow job, but only to read 
> the results after. 
> code 
> (https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py#L158):
>         _Dataflow(cmd).wait_for_done()
>         _DataflowJob(self.get_conn(), variables['project'],
>                      name, self.poll_sleep).wait_for_done()
> The first line here should also be using self.get_conn(). 
> For this reason, our tasks using the DataFlowOperator have actually been 
> using the default Google Compute Engine service account (which has DataFlow 
> permissions) to schedule DataFlow jobs. It is only when our provided service 
> account (which does not have DataFlow permissions) is used in the second line 
> that we are seeing a permissions error. 
> I would like to fix this bug, but have to work around it at the moment due to 
> time constraints. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to