[GitHub] [airflow] hterik opened a new issue, #27341: Move db check command to run inside task process.

GitBox Thu, 27 Oct 2022 23:59:15 -0700


hterik opened a new issue, #27341:
URL: https://github.com/apache/airflow/issues/27341

### Description

When using `KubernetesExecutor` and task workers based on the offical Docker
image, [there is an
Entrypoint](https://github.com/apache/airflow/blob/main/Dockerfile#L1408) that
runs a shell-script [where **airflow db check** is
called](https://github.com/apache/airflow/blob/main/scripts/docker/entrypoint_prod.sh#L282).

https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/scripts/docker/entrypoint_prod.sh#L282-L283

As opposed to celery tasks, kubernetes tasks go through this whole process
for every task instance started. With high number of task instances, this path
needs a bit more consideration.

Currently i see two issues
* Python interpreter startup is famously slow, especially on large projects
such as airflow. Even if i haven't measured this particular aspect to know if
it has significant impact or not, starting airflow twice just to make one ping
is redundant.
* New Postgresql connections are slow and expensive for the database. If the
db check could be run from within the same process as the task, the connection
pool from within sqlalchemy can reuse the same connection after it has been
successful. This could cut the number of connections made almost in half.
Correct me if i'm wrong about this, i'm not super deeply familiar with the
sqlalchemy internals and how it's integrated in Airflow. (Pgbouncer supposedly
helps with this but less connections are always better regardless).

----
My suggestion is to do something equivalent of just adding db.check() inside
task_command.task_run().
This could eventually also allow more rich code doing the check-retries and
logging, and simplify the entrypoint from any shell-script.
There is a `check_db=False` decorator on the `task_run` today. Setting this
to True however also includes db-migrations in the check. I'm not familiar
enough with this to say if that's a good idea or not.

https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/cli/commands/task_command.py#L312-L313

https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/utils/cli.py#L98-L102

----
A workaround is to set the envvar `CONNECTION_CHECK_MAX_COUNT=0` in your
`pod_template_file.yaml`. I don't know how that could impact stability of
running tasks in case the DB is unreachable on task startup. In a way that
scenario should ideally be handled anyway, as the check is just a sample and
potential race condition for the DB to be unreachable on next real operation.

### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] hterik opened a new issue, #27341: Move db check command to run inside task process.

Reply via email to