[GitHub] [airflow] kaxil commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

GitBox Thu, 07 Nov 2019 08:45:05 -0800

kaxil commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r343754810


 ##########
 File path: docs/howto/dags-in-production.rst
 ##########
 @@ -0,0 +1,247 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Getting a DAG ready for production
+==================================
+
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's the steps you need to follow to avoid these pitfalls.
+
+Writing a DAG
+^^^^^^^^^^^^^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+------------------
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Don't use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+
+Deleting a task
+----------------
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--------------
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections <concepts-connections>` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+.. note::
+
+    Don't write any critical code outside the tasks. The code outside the 
tasks runs every time airflow parses the DAG, which happens every second by 
default.
+
+    You should also avoid repeating arguments such as connection_id or S3 
paths using default_args. It helps you to avoid mistakes while passing 
arguments.
+
+
+
+Testing a DAG
+^^^^^^^^^^^^^
+
+Airflow users should treat DAGs as production level code. The DAGs should have 
various tests to ensure that it produces expected results.
+You can write a wide variety of tests for a DAG. Let's take a look at some of 
them.
+
+DAG Loader Test
+---------------
+
+This test should ensure that your DAG doesn't contain a piece of code that 
raises error while loading.
+No additional code needs to be written by the user to run this test.
+
+.. code::
+
+ python your-dag-file.py
+
+Running the above command without any error ensures your DAG doesn't contain 
any uninstalled dependency, syntax errors, etc. 
+
+You can look into :ref:`Testing a DAG <testing>` for details on how to test 
individual operators.
+
+Unit tests
+-----------
+
+Unit tests ensure that there is no incorrect code in your DAG. You can write a 
unit test for your tasks as well as your DAG.
+
+Unit test for loading a DAG
+
+.. code::
+
+ from airflow.models import DagBag
+ import unittest
+
+ class TestHelloWorldDAG(unittest.TestCase):
+ def setUp(self):
+ self.dagbag = DagBag()
+
+ def test_dag_loaded(self):
+ dag = self.dagbag.get_dag(dag_id='hello_world')
+ self.assertDictEqual(self.dagbag.import_errors, {})
+ self.assertIsNotNone(dag)
+ self.assertEqual(len(dag.tasks), 1)
 
 Review comment:
   ```diff
   - def test_dag_loaded(self):
   - dag = self.dagbag.get_dag(dag_id='hello_world')
   - self.assertDictEqual(self.dagbag.import_errors, {})
   - self.assertIsNotNone(dag)
   - self.assertEqual(len(dag.tasks), 1)
   
   + def test_dag_loaded(self):
   +    dag = self.dagbag.get_dag(dag_id='hello_world')
   +    self.assertDictEqual(self.dagbag.import_errors, {})
   +    self.assertIsNotNone(dag)
   +    self.assertEqual(len(dag.tasks), 1)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [airflow] kaxil commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

Reply via email to