Re: [PR] Improve testing harness to separate DB and non-DB tests [airflow]

via GitHub Tue, 31 Oct 2023 07:16:06 -0700


potiuk commented on code in PR #35160:
URL: https://github.com/apache/airflow/pull/35160#discussion_r1377661892



##########
TESTING.rst:
##########
@@ -72,56 +66,259 @@ fixture. This in turn makes Airflow load test 
configuration from the file
 defaults from ``airflow/config_templates/config.yml``. If you want to add some 
test-only configuration,
 as default for all tests you should add the value to this file.
 
-You can also of course override the values in individual test by patching 
environment variables following
+You can also - of course - override the values in individual test by patching 
environment variables following
 the usual ``AIRFLOW__SECTION__KEY`` pattern or ``conf_vars`` context manager.
 
-.. note:: Previous way of setting the test configuration
+Airflow test types
+------------------
 
-  The test configuration for Airflow before July 2023 was automatically 
generated in a file named
-  ``AIRFLOW_HOME/unittest.cfg``. The template for it was stored in 
"config_templates" next to the yaml file.
-  However writing the file was only done for the first time you run airflow 
and you had to manually
-  maintain the file. It was pretty arcane knowledge, and this generated file 
in {AIRFLOW_HOME}
-  has been overwritten in the Breeze environment with another CI-specific 
file. Using ``unit_tests.cfg``
-  as a single source of the configuration for tests - coming from Airflow 
sources
-  rather than from {AIRFLOW_HOME} is much more convenient and it is 
automatically used by pytest.
+Airflow tests in the CI environment are split into several test types. You can 
narrow down which
+test types you want to use in various ``breeze testing`` sub-commands in three 
ways:
 
-  The unittest.cfg file generated in {AIRFLOW_HOME} will no longer be used and 
can be removed.
+* via specifying the ``--test-type`` when you run single test type in ``breeze 
testing tests`` command
+* via specifying space separating list of test types via 
``--paralleltest-types`` or
+  ``--exclude-parallel-test-types`` options when you run tests in parallel (in 
several testing commands)
 
+Those test types are defined:
 
-Airflow test types
-------------------
+* ``Always`` - those are tests that should be always executed (always 
sub-folder)
+* ``API`` - Tests for the Airflow API (api, api_connexion, api_experimental 
and api_internal sub-folders)
+* ``CLI`` - Tests for the Airflow CLI (cli folder)
+* ``Core`` - for the core Airflow functionality (core, executors, jobs, 
models, ti_deps, utils sub-folders)
+* ``Operators`` - tests for the operators (operators folder with exception of 
Virtualenv Operator tests and
+  External Python Operator tests that have their own test type). They are 
skipped by the
+``virtualenv_operator`` and ``external_python_operator`` test markers that the 
tests are marked with.
+* ``WWW`` - Tests for the Airflow webserver (www folder)
+* ``Providers`` - Tests for all Providers of Airflow (providers folder)
+* ``PlainAsserts`` - tests that require disabling ``assert-rewrite`` feature 
of Pytest (usually because
+  a buggy/complex implementation of an imported library) (``plain_asserts`` 
marker)
+* ``Other`` - all other tests remaining after the above tests are selected
 
-Airflow tests in the CI environment are split into several test types:
+There are also Virtualenv/ExternalPython operator test types that are excluded 
from ``Operators`` test type
+and run as separate test types. Those are :
 
-* Always - those are tests that should be always executed (always folder)
-* Core - for the core Airflow functionality (core folder)
-* API - Tests for the Airflow API (api and api_connexion folders)
-* CLI - Tests for the Airflow CLI (cli folder)
-* WWW - Tests for the Airflow webserver (www folder)
-* Providers - Tests for all Providers of Airflow (providers folder)
-* Other - all other tests (all other folders that are not part of any of the 
above)
+* ``PythonVenv`` - tests for PythonVirtualenvOperator - selected directly as 
TestPythonVirtualenvOperator
+* ``BranchPythonVenv`` - tests for BranchPythonVirtualenvOperator - selected 
directly as TestBranchPythonVirtualenvOperator
+* ``ExternalPython`` - tests for ExternalPythonOperator - selected directly as 
TestExternalPythonOperator
+* ``BranchExternalPython`` - tests for BranchExternalPythonOperator - selected 
directly as TestBranchExternalPythonOperator
+
+We have also tests that run "all" tests (so they do not look at the folder, 
but at the ``pytest`` markers
+the tests are marked with to run with some filters applied.
+
+* ``All-Postgres`` - tests that require Postgres database. They are only run 
when backend is Postgres (``backend("postgres")`` marker)
+* ``All-MySQL`` - tests that require MySQL database. They are only run when 
backend is MySQL (``backend("mysql")`` marker)
+* ``All-Quarantined`` - tests that are flaky and need to be fixed 
(``quarantined`` marker)
+* ``All`` - all tests are run (this is the default)
+
+
+We also have ``Integration`` tests that are running Integration tests with 
external software that is run
+via ``--integration`` flag in ``breeze`` environment - via ``breeze testing 
integration-tests``.
+
+* ``Integration`` - tests that require external integration images running in 
docker-compose
 
 This is done for three reasons:
 
 1. in order to selectively run only subset of the test types for some PRs
-2. in order to allow parallel execution of the tests on Self-Hosted runners
+2. in order to allow efficient parallel test execution of the tests on 
Self-Hosted runners
 
 For case 2. We can utilise memory and CPUs available on both CI and local 
development machines to run
-test in parallel. This way we can decrease the time of running all tests in 
self-hosted runners from
-60 minutes to ~15 minutes.
+test in parallel, but we cannot use pytest xdist plugin for that - we need to 
split the tests into test
+types and run each test type with their own instance of database and separate 
container where the tests
+in each type are run with exclusive access to their database and each test 
within test type runs sequentially.
+By the nature of those tests - they rely on shared databases - and they 
update/reset/cleanup data in the
+databases while they are executing.
 
-.. note::
 
-  We need to split tests manually into separate suites rather than utilise
-  ``pytest-xdist`` or ``pytest-parallel`` which could be a simpler and much 
more "native" parallelization
-  mechanism. Unfortunately, we cannot utilise those tools because our tests 
are not truly ``unit`` tests that
-  can run in parallel. A lot of our tests rely on shared databases - and they 
update/reset/cleanup the
-  databases while they are executing. They are also exercising features of the 
Database such as locking which
-  further increases cross-dependency between tests. Until we make all our 
tests truly unit tests (and not
-  touching the database or until we isolate all such tests to a separate test 
type, we cannot really rely on
-  frameworks that run tests in parallel. In our solution each of the test 
types is run in parallel with its
-  own database (!) so when we have 8 test types running in parallel, there are 
in fact 8 databases run
-  behind the scenes to support them and each of the test types executes its 
own tests sequentially.
+DB and non-DB tests
+-------------------
+
+There are two kinds of unit tests in Airflow - DB and non-DB tests.
+
+Some of the tests of Airflow (around 7000 of them on October 2023)
+require a database to connect to in order to run. Those tests store and read 
data from Airflow DB using
+Airflow's core code and it's crucial to run the tests against all real 
databases that Airflow supports in order
+to check if the SQLAlchemy queries are correct and if the database schema is 
correct.
+
+Those tests should be marked with ``@pytest.mark.db`` decorator on one of the 
levels:
+
+* test method can be marked with ``@pytest.mark.db`` decorator
+* test class can be marked with ``@pytest.mark.db`` decorator
+* test module can be marked with ``pytestmark = pytest.mark.db`` at the top 
level of the module
+
+Airflow's CI runs different test kinds separately.
+
+For the DB tests, they are run against the multiple databases Airflow support, 
multiple versions of those
+and multiple Python versions it supports. In order to save time for testing 
not all combinations are
+tested but enough various combinations are tested to detect potential problems.
+
+As of October 2023, Airflow has ~9000 Non-DB tests and around 7000 DB tests.
+
+Airflow non-DB tests
+--------------------
+
+For the Non-DB tests, they are run once for each tested Python version with 
``none`` database backend (which
+causes any database access to fail. Those tests are run with ``pytest-xdist`` 
plugin in parallel which
+means that we can efficiently utilised multi-processor machines (including 
``self-hosted`` runners with
+8 CPUS we have to run the tests with maximum parallelism).
+
+It's usually straightforward to run those tests in local virtualenv because 
they do not require any
+setup or running database. They also run much faster than DB tests. You can 
run them with ``pytest`` command
+or with ``breeze`` that has all the dependencies needed to run all tests 
automatically installed. Of course
+you can also select just specific test or folder or module for the Pytest to 
collect/run tests from there,
+the example below shows how to run all tests, parallelising them with 
``pytest-xdist``
+(by specifying ``tests`` folder):
+
+.. code-block:: bash
+
+    pytest tests --skip-db-tests -n auto
+
+
+The ``--skip-db-tests`` flag will only run tests that are not marked as DB 
tests.
+
+
+You can also run ``breeze`` command to run all the tests (they will run in a 
separate container,
+the selected python version and without access to any database). Adding 
``--use-xdist`` flag will run all
+tests in parallel using ``pytest-xdist`` plugin.
+
+We have a dedicated, opinionated ``breeze testing non-db-tests`` command as 
well that runs non-DB tests
+(it is also used in CI to run the non-DB tests, where you do not have to 
specify extra flags for
+parallel running and you can run all the Non-DB tests
+(or just a subset of them with ``--parallel-test-types`` or 
``--exclude-parallel-test-types``) in parallel:
+
+.. code-block:: bash
+
+    breeze testing non-db-tests
+
+You can pass ``--parallel-test-type`` list of test types to execute or 
``--exclude--parallel-test-types``
+to exclude them from the default set:.
+
+.. code-block:: bash
+
+    breeze testing non-db-tests --parallel-test-types "Providers API CLI"
+
+
+.. code-block:: bash
+
+    breeze testing non-db-tests --exclude-parallel-test-types "Providers API 
CLI"
+
+You can also run the same commands via ``breeze testing tests`` - by adding 
the necessary flags manually:
+
+.. code-block:: bash
+
+    breeze testing tests --skip-db-tests --backend none --use-xdist
+
+Also you can enter interactive shell with ``breeze`` and run tests from there 
if you want to iterate
+with the tests. Source files in ``breeze`` are mounted as volumes so you can 
modify them locally and
+rerun in Breeze as you will (``-n auto`` will parallelize tests using 
``pytest-xdist`` plugin):
+
+.. code-block:: bash
+
+    breeze shell --backend none --python 3.8
+    > pytest tests --skip-db-tests -n auto
+
+
+Airflow DB tests
+----------------
+
+Airflow DB tests require database to run. It can be any of the supported 
Airflow Databases and they can
+be run either using local virtualenv or Breeze
+
+
+
+By default, the DB tests will use sqlite and the "airflow.db" database created 
and populated in the
+``${AIRFLOW_HOME}`` folder. You do not need to do anything to get the database 
created and initialized,
+but if you need to clean and restart the db, you can run tests with 
``-with-db-init`` flag - then the
+database will be re-initialized. You can also set 
``AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`` environment
+variable to point to supported database (Postgres, MySQL, etc.) and the tests 
will use that database. You
+might need to run ``airflow db reset`` to initialize the database in that case.
+
+The "non-DB" tests are perfectly fine to run when you have database around but 
if you want to just run
+DB tests (as happens in our CI for the ``Database`` runs) you can use 
``--run-db-tests-only`` flag to filter
+out non-DB tests (and obviously you can specify not only on the whole 
``tests`` directory but on any
+folders/files/tests selection, ``pytest`` supports).
+
+.. code-block:: bash
+
+    pytest tests/ --run-db-tests-only
+
+You can also run DB tests with ``breeze`` dockerized environment. You can 
choose backend to use with
+``--backend`` flag. The default is ``sqlite`` but you can also use others such 
as ``postgres`` or ``mysql``.
+You can also select backend version and Python version to use. You can specify 
the ``test-type`` to run -
+breeze will list the test types you can run with ``--help`` and provide 
auto-complete for them. Example
+below runs the ``Core`` tests with ``postgres`` backend and ``3.8`` Python 
version:
+
+We have a dedicated, opinionated ``breeze testing db-tests`` command as well 
that runs DB tests
+(it is also used in CI to run the DB tests, where you do not have to specify 
extra flags for
+parallel running and you can run all the DB tests
+(or just a subset of them with ``--parallel-test-types`` or 
``--exclude-parallel-test-types``) in parallel:
+
+.. code-block:: bash
+
+    breeze testing non-db-tests --backent postgres
+
+You can pass ``--parallel-test-type`` list of test types to execute or 
``--exclude--parallel-test-types``
+to exclude them from the default set:.
+
+.. code-block:: bash
+
+    breeze testing db-tests --parallel-test-types "Providers API CLI"
+
+
+.. code-block:: bash
+
+    breeze testing db-tests --exclude-parallel-test-types "Providers API CLI"
+
+You can also run the same commands via ``breeze testing tests`` - by adding 
the necessary flags manually:
+
+.. code-block:: bash
+
+    breeze testing tests --run-db-tests-only --backend postgres 
--run-tests-in-parallel
+
+
+Also - if you want to iterate with the tests you can enter interactive shell 
and run the tests iteratively -
+either by package/module/test or by test type - whatever ``pytest`` supports.
+
+.. code-block:: bash
+
+    breeze shell --backend postgres --python 3.8
+    > pytest tests --run-db-tests-only
+
+As explained before, you cannot run DB tests in parallel using 
``pytest-xdist`` plugin, but ``breeze`` has
+support to split all the tests into test-types to run in separate containers 
and with separate databases
+and you can run the tests using ``--run-tests-in-parallel`` flag (which is 
automatically enabled when
+you use ``breeze testing db-tests`` command):
+
+.. code-block:: bash
+
+    breeze testing tests --run-db-tests-only --backend postgres --python 3.8 
--run-tests-in-parallel
+
+
+Best practices for writing DB / Non-DB tests
+============================================
+
+Usually when you add new tests you add tests "similar" to the ones that are 
already there. In most cases,
+therefore you do not have to worry about the test type - it will be 
automatically selected for you by the
+fact that the Test Class that you add the tests or the whole module will be 
marked with ``db_test`` marker.
+
+You should strive to write "pure" unit tests (i.e. DB tests) but sometimes 
it's just better to plug-in
+the existing framework of DagRuns, Dags, Connections and Variables to use the 
Database directly rather
+than having to mock the DB access for example. It's up to you to decide.
+
+However, if you choose to write DB tests you have to make sure you add the 
``db_test`` marker - either to
+the test method, class (with decorator) or whole module (with pytestmark at 
the top level of the module).
+
+TODO: add examples

Review Comment:
   Yep. Will update those and link to the best practices docs when someone's 
"non-db" tests will be failing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve testing harness to separate DB and non-DB tests [airflow]

Reply via email to