(airflow) branch main updated: Replace numpy example with practical exercise demonstrating top-level code (#35097)

potiuk Fri, 05 Jan 2024 09:26:41 -0800

This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git



The following commit(s) were added to refs/heads/main by this push:
     new ba20baeafd Replace numpy example with practical exercise demonstrating 
top-level code (#35097)
ba20baeafd is described below

commit ba20baeafd5e28c164c37a837337b501bf8cde3f
Author: Ryan Hatter <[email protected]>
AuthorDate: Fri Jan 5 12:26:28 2024 -0500

    Replace numpy example with practical exercise demonstrating top-level code 
(#35097)
    
    * Replace numpy example with a practical exercise demonstrating top-level 
code
---
 docs/apache-airflow/best-practices.rst | 74 ++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 21 deletions(-)

diff --git a/docs/apache-airflow/best-practices.rst 
b/docs/apache-airflow/best-practices.rst
index a0c277f4a9..e166c10247 100644
--- a/docs/apache-airflow/best-practices.rst
+++ b/docs/apache-airflow/best-practices.rst
@@ -115,10 +115,10 @@ One of the important factors impacting DAG loading time, 
that might be overlooke
 that top-level imports might take surprisingly a lot of time and they can 
generate a lot of overhead
 and this can be easily avoided by converting them to local imports inside 
Python callables for example.
 
-Consider the example below - the first DAG will parse significantly slower (in 
the orders of seconds)
-than equivalent DAG where the ``numpy`` module is imported as local import in 
the callable.
+Consider the two examples below. In the first example, DAG will take an 
additional 1000 seconds to parse
+than the functionally equivalent DAG in the second example where the 
``expensive_api_call`` is executed from the context of its task.
 
-Bad example:
+Not avoiding top-level DAG code:
 
 .. code-block:: python
 
@@ -127,7 +127,13 @@ Bad example:
   from airflow import DAG
   from airflow.decorators import task
 
-  import numpy as np  # <-- THIS IS A VERY BAD IDEA! DON'T DO THAT!
+
+  def expensive_api_call():
+      print("Hello from Airflow!")
+      sleep(1000)
+
+
+  my_expensive_response = expensive_api_call()
 
   with DAG(
       dag_id="example_python_operator",
@@ -138,15 +144,10 @@ Bad example:
   ) as dag:
 
       @task()
-      def print_array():
-          """Print Numpy array."""
-          a = np.arange(15).reshape(3, 5)
-          print(a)
-          return a
-
-      print_array()
+      def print_expensive_api_call():
+          print(my_expensive_response)
 
-Good example:
+Avoiding top-level DAG code:
 
 .. code-block:: python
 
@@ -155,6 +156,12 @@ Good example:
   from airflow import DAG
   from airflow.decorators import task
 
+
+  def expensive_api_call():
+      sleep(1000)
+      return "Hello from Airflow!"
+
+
   with DAG(
       dag_id="example_python_operator",
       schedule=None,
@@ -164,19 +171,44 @@ Good example:
   ) as dag:
 
       @task()
-      def print_array():
-          """Print Numpy array."""
-          import numpy as np  # <- THIS IS HOW NUMPY SHOULD BE IMPORTED IN 
THIS CASE!
+      def print_expensive_api_call():
+          my_expensive_response = expensive_api_call()
+          print(my_expensive_response)
+
+In the first example, ``expensive_api_call`` is executed each time the DAG 
file is parsed, which will result in suboptimal performance in the DAG file 
processing. In the second example, ``expensive_api_call`` is only called when 
the task is running and thus is able to be parsed without suffering any 
performance hits. To test it out yourself, implement the first DAG and see 
"Hello from Airflow!" printed in the scheduler logs!
+
+Note that import statements also count as top-level code. So, if you have an 
import statement that takes a long time or the imported module itself executes 
code at the top-level, that can also impact the performance of the scheduler. 
The following example illustrates how to handle expensive imports.
+
+.. code-block:: python
+
+  # It's ok to import modules that are not expensive to load at top-level of a 
DAG file
+  import random
+  import pendulum
+
+  # Expensive imports should be avoided as top level imports, because DAG 
files are parsed frequently, resulting in top-level code being executed.
+  #
+  # import pandas
+  # import torch
+  # import tensorflow
+  #
+
+  ...
+
+
+  @task()
+  def do_stuff_with_pandas_and_torch():
+      import pandas
+      import torch
+
+      # do some operations using pandas and torch
 
-          a = np.arange(15).reshape(3, 5)
-          print(a)
-          return a
 
-      print_array()
+  @task()
+  def do_stuff_with_tensorflow():
+      import tensorflow
 
-In the Bad example, NumPy is imported each time the DAG file is parsed, which 
will result in suboptimal performance in the DAG file processing. In the Good 
example, NumPy is only imported when the task is running.
+      # do some operations using tensorflow
 
-Since it is not always obvious, see the next chapter to check how my code is 
"top-level" code.
 
 How to check if my code is "top-level" code
 -------------------------------------------

(airflow) branch main updated: Replace numpy example with practical exercise demonstrating top-level code (#35097)

Reply via email to