TL;DR Is there any recommended way to lazily load input for Airflow operators?


I could not found a way to do this. While I faced this limitation while using 
the Databricks operator, it seems other operators might potentially lack such a 
functionality. Please, keep reading for more details.


---


When instantiating a DatabricksSubmitRunOperator 
(https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/databricks_operator.py)
 users need to pass the description of the job that will later be executed on 
Databricks.

The job description is only needed at execution time (when the hook is called). 
However, the json parameter must already have the full job description when 
constructing the operator. This may present a problem if computing the job 
description needs to execute expensive operations (e.g., querying a database). 
The expensive operation will be invoked every single time the DAG is 
reprocessed (which may happen quite frequently).

It would be good to have an equivalent mechanism to the python_callable 
parameter in the PythonOperator. In this way, users could pass a function that 
would generate the job description only when the operator is actually executed. 
I discussed this with Andrew Chen (from Databricks), and he agrees it would be 
an interesting feature to add.


Does this sound reasonable? Is this use case supported in some way that I am 
unaware of?


You can find the issue I created here: 
https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-2964

Reply via email to