[ 
https://issues.apache.org/jira/browse/AIRFLOW-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553519#comment-16553519
 ] 

Evgeny Podlepaev commented on AIRFLOW-2774:
-------------------------------------------

[~kaxilnaik],  I didn't mean testing the DataFlow job itself. My justification 
is that selective use of DirectRunner would allow faster end-to-end testing of 
the Airflow dag containing it. Running even the simplest DataFlow job in the 
cloud takes minutes - in my case, a 100 line file would be processed in 5+ 
minutes - so when you want to test the integration of different operators on a 
subset of data the DataFlow operator becomes a bottleneck. My current 
workaround is to use a factory method where I check the "runner" option, and if 
a DirectRunner was specified, I use a BashOperator instead of the 
DataFlowPythonOperator to start the job.

> DataFlowPythonOperator needs to support DirectRunner to facilitate local 
> testing
> --------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-2774
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2774
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: Dataflow
>    Affects Versions: 1.9.0
>            Reporter: Evgeny Podlepaev
>            Priority: Minor
>
> **DataFlowPythonOperator needs to support DirectRunner as a runner option to 
> facilitate local testing of the entire pipeline. Right now if DirectRunner is 
> set via job options, the DataFlowHook will wait infinitely trying to get 
> status of the remote job which does not exist:
> _DataflowJob(self.get_conn(), variables['project'], name,
> variables['region'], self.poll_sleep).wait_for_done()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to