[
https://issues.apache.org/jira/browse/AIRFLOW-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553519#comment-16553519
]
Evgeny Podlepaev commented on AIRFLOW-2774:
-------------------------------------------
[~kaxilnaik], I didn't mean testing the DataFlow job itself. My justification
is that selective use of DirectRunner would allow faster end-to-end testing of
the Airflow dag containing it. Running even the simplest DataFlow job in the
cloud takes minutes - in my case, a 100 line file would be processed in 5+
minutes - so when you want to test the integration of different operators on a
subset of data the DataFlow operator becomes a bottleneck. My current
workaround is to use a factory method where I check the "runner" option, and if
a DirectRunner was specified, I use a BashOperator instead of the
DataFlowPythonOperator to start the job.
> DataFlowPythonOperator needs to support DirectRunner to facilitate local
> testing
> --------------------------------------------------------------------------------
>
> Key: AIRFLOW-2774
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2774
> Project: Apache Airflow
> Issue Type: Improvement
> Components: Dataflow
> Affects Versions: 1.9.0
> Reporter: Evgeny Podlepaev
> Priority: Minor
>
> **DataFlowPythonOperator needs to support DirectRunner as a runner option to
> facilitate local testing of the entire pipeline. Right now if DirectRunner is
> set via job options, the DataFlowHook will wait infinitely trying to get
> status of the remote job which does not exist:
> _DataflowJob(self.get_conn(), variables['project'], name,
> variables['region'], self.poll_sleep).wait_for_done()
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)