Jessica Laughlin created AIRFLOW-2009:
-----------------------------------------
Summary: DataFlowHook does not use correct service account
Key: AIRFLOW-2009
URL: https://issues.apache.org/jira/browse/AIRFLOW-2009
Project: Apache Airflow
Issue Type: Bug
Components: Dataflow, hooks
Affects Versions: Airflow 2.0
Reporter: Jessica Laughlin
We have been using the DataFlowOperator to schedule DataFlow jobs.
We found that the DataFlowHook used by the DataFlowOperator doesn't actually
use the passed `gcp_conn_id` to schedule the DataFlow job, but only to read the
results after.
code
(https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py#L158):
_Dataflow(cmd).wait_for_done()
_DataflowJob(self.get_conn(), variables['project'],
name, self.poll_sleep).wait_for_done()
The first line here should also be using self.get_conn().
For this reason, our tasks using the DataFlowOperator have actually been using
the default Google Compute Engine service account (which has DataFlow
permissions) to schedule DataFlow jobs. It is only when our provided service
account (which does not have DataFlow permissions) is used in the second line
that we are seeing a permissions error.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)