potiuk commented on code in PR #35160:
URL: https://github.com/apache/airflow/pull/35160#discussion_r1377661892
##########
TESTING.rst:
##########
@@ -72,56 +66,259 @@ fixture. This in turn makes Airflow load test
configuration from the file
defaults from ``airflow/config_templates/config.yml``. If you want to add some
test-only configuration,
as default for all tests you should add the value to this file.
-You can also of course override the values in individual test by patching
environment variables following
+You can also - of course - override the values in individual test by patching
environment variables following
the usual ``AIRFLOW__SECTION__KEY`` pattern or ``conf_vars`` context manager.
-.. note:: Previous way of setting the test configuration
+Airflow test types
+------------------
- The test configuration for Airflow before July 2023 was automatically
generated in a file named
- ``AIRFLOW_HOME/unittest.cfg``. The template for it was stored in
"config_templates" next to the yaml file.
- However writing the file was only done for the first time you run airflow
and you had to manually
- maintain the file. It was pretty arcane knowledge, and this generated file
in {AIRFLOW_HOME}
- has been overwritten in the Breeze environment with another CI-specific
file. Using ``unit_tests.cfg``
- as a single source of the configuration for tests - coming from Airflow
sources
- rather than from {AIRFLOW_HOME} is much more convenient and it is
automatically used by pytest.
+Airflow tests in the CI environment are split into several test types. You can
narrow down which
+test types you want to use in various ``breeze testing`` sub-commands in three
ways:
- The unittest.cfg file generated in {AIRFLOW_HOME} will no longer be used and
can be removed.
+* via specifying the ``--test-type`` when you run single test type in ``breeze
testing tests`` command
+* via specifying space separating list of test types via
``--paralleltest-types`` or
+ ``--exclude-parallel-test-types`` options when you run tests in parallel (in
several testing commands)
+Those test types are defined:
-Airflow test types
-------------------
+* ``Always`` - those are tests that should be always executed (always
sub-folder)
+* ``API`` - Tests for the Airflow API (api, api_connexion, api_experimental
and api_internal sub-folders)
+* ``CLI`` - Tests for the Airflow CLI (cli folder)
+* ``Core`` - for the core Airflow functionality (core, executors, jobs,
models, ti_deps, utils sub-folders)
+* ``Operators`` - tests for the operators (operators folder with exception of
Virtualenv Operator tests and
+ External Python Operator tests that have their own test type). They are
skipped by the
+``virtualenv_operator`` and ``external_python_operator`` test markers that the
tests are marked with.
+* ``WWW`` - Tests for the Airflow webserver (www folder)
+* ``Providers`` - Tests for all Providers of Airflow (providers folder)
+* ``PlainAsserts`` - tests that require disabling ``assert-rewrite`` feature
of Pytest (usually because
+ a buggy/complex implementation of an imported library) (``plain_asserts``
marker)
+* ``Other`` - all other tests remaining after the above tests are selected
-Airflow tests in the CI environment are split into several test types:
+There are also Virtualenv/ExternalPython operator test types that are excluded
from ``Operators`` test type
+and run as separate test types. Those are :
-* Always - those are tests that should be always executed (always folder)
-* Core - for the core Airflow functionality (core folder)
-* API - Tests for the Airflow API (api and api_connexion folders)
-* CLI - Tests for the Airflow CLI (cli folder)
-* WWW - Tests for the Airflow webserver (www folder)
-* Providers - Tests for all Providers of Airflow (providers folder)
-* Other - all other tests (all other folders that are not part of any of the
above)
+* ``PythonVenv`` - tests for PythonVirtualenvOperator - selected directly as
TestPythonVirtualenvOperator
+* ``BranchPythonVenv`` - tests for BranchPythonVirtualenvOperator - selected
directly as TestBranchPythonVirtualenvOperator
+* ``ExternalPython`` - tests for ExternalPythonOperator - selected directly as
TestExternalPythonOperator
+* ``BranchExternalPython`` - tests for BranchExternalPythonOperator - selected
directly as TestBranchExternalPythonOperator
+
+We have also tests that run "all" tests (so they do not look at the folder,
but at the ``pytest`` markers
+the tests are marked with to run with some filters applied.
+
+* ``All-Postgres`` - tests that require Postgres database. They are only run
when backend is Postgres (``backend("postgres")`` marker)
+* ``All-MySQL`` - tests that require MySQL database. They are only run when
backend is MySQL (``backend("mysql")`` marker)
+* ``All-Quarantined`` - tests that are flaky and need to be fixed
(``quarantined`` marker)
+* ``All`` - all tests are run (this is the default)
+
+
+We also have ``Integration`` tests that are running Integration tests with
external software that is run
+via ``--integration`` flag in ``breeze`` environment - via ``breeze testing
integration-tests``.
+
+* ``Integration`` - tests that require external integration images running in
docker-compose
This is done for three reasons:
1. in order to selectively run only subset of the test types for some PRs
-2. in order to allow parallel execution of the tests on Self-Hosted runners
+2. in order to allow efficient parallel test execution of the tests on
Self-Hosted runners
For case 2. We can utilise memory and CPUs available on both CI and local
development machines to run
-test in parallel. This way we can decrease the time of running all tests in
self-hosted runners from
-60 minutes to ~15 minutes.
+test in parallel, but we cannot use pytest xdist plugin for that - we need to
split the tests into test
+types and run each test type with their own instance of database and separate
container where the tests
+in each type are run with exclusive access to their database and each test
within test type runs sequentially.
+By the nature of those tests - they rely on shared databases - and they
update/reset/cleanup data in the
+databases while they are executing.
-.. note::
- We need to split tests manually into separate suites rather than utilise
- ``pytest-xdist`` or ``pytest-parallel`` which could be a simpler and much
more "native" parallelization
- mechanism. Unfortunately, we cannot utilise those tools because our tests
are not truly ``unit`` tests that
- can run in parallel. A lot of our tests rely on shared databases - and they
update/reset/cleanup the
- databases while they are executing. They are also exercising features of the
Database such as locking which
- further increases cross-dependency between tests. Until we make all our
tests truly unit tests (and not
- touching the database or until we isolate all such tests to a separate test
type, we cannot really rely on
- frameworks that run tests in parallel. In our solution each of the test
types is run in parallel with its
- own database (!) so when we have 8 test types running in parallel, there are
in fact 8 databases run
- behind the scenes to support them and each of the test types executes its
own tests sequentially.
+DB and non-DB tests
+-------------------
+
+There are two kinds of unit tests in Airflow - DB and non-DB tests.
+
+Some of the tests of Airflow (around 7000 of them on October 2023)
+require a database to connect to in order to run. Those tests store and read
data from Airflow DB using
+Airflow's core code and it's crucial to run the tests against all real
databases that Airflow supports in order
+to check if the SQLAlchemy queries are correct and if the database schema is
correct.
+
+Those tests should be marked with ``@pytest.mark.db`` decorator on one of the
levels:
+
+* test method can be marked with ``@pytest.mark.db`` decorator
+* test class can be marked with ``@pytest.mark.db`` decorator
+* test module can be marked with ``pytestmark = pytest.mark.db`` at the top
level of the module
+
+Airflow's CI runs different test kinds separately.
+
+For the DB tests, they are run against the multiple databases Airflow support,
multiple versions of those
+and multiple Python versions it supports. In order to save time for testing
not all combinations are
+tested but enough various combinations are tested to detect potential problems.
+
+As of October 2023, Airflow has ~9000 Non-DB tests and around 7000 DB tests.
+
+Airflow non-DB tests
+--------------------
+
+For the Non-DB tests, they are run once for each tested Python version with
``none`` database backend (which
+causes any database access to fail. Those tests are run with ``pytest-xdist``
plugin in parallel which
+means that we can efficiently utilised multi-processor machines (including
``self-hosted`` runners with
+8 CPUS we have to run the tests with maximum parallelism).
+
+It's usually straightforward to run those tests in local virtualenv because
they do not require any
+setup or running database. They also run much faster than DB tests. You can
run them with ``pytest`` command
+or with ``breeze`` that has all the dependencies needed to run all tests
automatically installed. Of course
+you can also select just specific test or folder or module for the Pytest to
collect/run tests from there,
+the example below shows how to run all tests, parallelising them with
``pytest-xdist``
+(by specifying ``tests`` folder):
+
+.. code-block:: bash
+
+ pytest tests --skip-db-tests -n auto
+
+
+The ``--skip-db-tests`` flag will only run tests that are not marked as DB
tests.
+
+
+You can also run ``breeze`` command to run all the tests (they will run in a
separate container,
+the selected python version and without access to any database). Adding
``--use-xdist`` flag will run all
+tests in parallel using ``pytest-xdist`` plugin.
+
+We have a dedicated, opinionated ``breeze testing non-db-tests`` command as
well that runs non-DB tests
+(it is also used in CI to run the non-DB tests, where you do not have to
specify extra flags for
+parallel running and you can run all the Non-DB tests
+(or just a subset of them with ``--parallel-test-types`` or
``--exclude-parallel-test-types``) in parallel:
+
+.. code-block:: bash
+
+ breeze testing non-db-tests
+
+You can pass ``--parallel-test-type`` list of test types to execute or
``--exclude--parallel-test-types``
+to exclude them from the default set:.
+
+.. code-block:: bash
+
+ breeze testing non-db-tests --parallel-test-types "Providers API CLI"
+
+
+.. code-block:: bash
+
+ breeze testing non-db-tests --exclude-parallel-test-types "Providers API
CLI"
+
+You can also run the same commands via ``breeze testing tests`` - by adding
the necessary flags manually:
+
+.. code-block:: bash
+
+ breeze testing tests --skip-db-tests --backend none --use-xdist
+
+Also you can enter interactive shell with ``breeze`` and run tests from there
if you want to iterate
+with the tests. Source files in ``breeze`` are mounted as volumes so you can
modify them locally and
+rerun in Breeze as you will (``-n auto`` will parallelize tests using
``pytest-xdist`` plugin):
+
+.. code-block:: bash
+
+ breeze shell --backend none --python 3.8
+ > pytest tests --skip-db-tests -n auto
+
+
+Airflow DB tests
+----------------
+
+Airflow DB tests require database to run. It can be any of the supported
Airflow Databases and they can
+be run either using local virtualenv or Breeze
+
+
+
+By default, the DB tests will use sqlite and the "airflow.db" database created
and populated in the
+``${AIRFLOW_HOME}`` folder. You do not need to do anything to get the database
created and initialized,
+but if you need to clean and restart the db, you can run tests with
``-with-db-init`` flag - then the
+database will be re-initialized. You can also set
``AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`` environment
+variable to point to supported database (Postgres, MySQL, etc.) and the tests
will use that database. You
+might need to run ``airflow db reset`` to initialize the database in that case.
+
+The "non-DB" tests are perfectly fine to run when you have database around but
if you want to just run
+DB tests (as happens in our CI for the ``Database`` runs) you can use
``--run-db-tests-only`` flag to filter
+out non-DB tests (and obviously you can specify not only on the whole
``tests`` directory but on any
+folders/files/tests selection, ``pytest`` supports).
+
+.. code-block:: bash
+
+ pytest tests/ --run-db-tests-only
+
+You can also run DB tests with ``breeze`` dockerized environment. You can
choose backend to use with
+``--backend`` flag. The default is ``sqlite`` but you can also use others such
as ``postgres`` or ``mysql``.
+You can also select backend version and Python version to use. You can specify
the ``test-type`` to run -
+breeze will list the test types you can run with ``--help`` and provide
auto-complete for them. Example
+below runs the ``Core`` tests with ``postgres`` backend and ``3.8`` Python
version:
+
+We have a dedicated, opinionated ``breeze testing db-tests`` command as well
that runs DB tests
+(it is also used in CI to run the DB tests, where you do not have to specify
extra flags for
+parallel running and you can run all the DB tests
+(or just a subset of them with ``--parallel-test-types`` or
``--exclude-parallel-test-types``) in parallel:
+
+.. code-block:: bash
+
+ breeze testing non-db-tests --backent postgres
+
+You can pass ``--parallel-test-type`` list of test types to execute or
``--exclude--parallel-test-types``
+to exclude them from the default set:.
+
+.. code-block:: bash
+
+ breeze testing db-tests --parallel-test-types "Providers API CLI"
+
+
+.. code-block:: bash
+
+ breeze testing db-tests --exclude-parallel-test-types "Providers API CLI"
+
+You can also run the same commands via ``breeze testing tests`` - by adding
the necessary flags manually:
+
+.. code-block:: bash
+
+ breeze testing tests --run-db-tests-only --backend postgres
--run-tests-in-parallel
+
+
+Also - if you want to iterate with the tests you can enter interactive shell
and run the tests iteratively -
+either by package/module/test or by test type - whatever ``pytest`` supports.
+
+.. code-block:: bash
+
+ breeze shell --backend postgres --python 3.8
+ > pytest tests --run-db-tests-only
+
+As explained before, you cannot run DB tests in parallel using
``pytest-xdist`` plugin, but ``breeze`` has
+support to split all the tests into test-types to run in separate containers
and with separate databases
+and you can run the tests using ``--run-tests-in-parallel`` flag (which is
automatically enabled when
+you use ``breeze testing db-tests`` command):
+
+.. code-block:: bash
+
+ breeze testing tests --run-db-tests-only --backend postgres --python 3.8
--run-tests-in-parallel
+
+
+Best practices for writing DB / Non-DB tests
+============================================
+
+Usually when you add new tests you add tests "similar" to the ones that are
already there. In most cases,
+therefore you do not have to worry about the test type - it will be
automatically selected for you by the
+fact that the Test Class that you add the tests or the whole module will be
marked with ``db_test`` marker.
+
+You should strive to write "pure" unit tests (i.e. DB tests) but sometimes
it's just better to plug-in
+the existing framework of DagRuns, Dags, Connections and Variables to use the
Database directly rather
+than having to mock the DB access for example. It's up to you to decide.
+
+However, if you choose to write DB tests you have to make sure you add the
``db_test`` marker - either to
+the test method, class (with decorator) or whole module (with pytestmark at
the top level of the module).
+
+TODO: add examples
Review Comment:
Yep. Will update those and link to the best practices docs when someone's
"non-db" tests will be failing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]