hterik opened a new issue, #27341:
URL: https://github.com/apache/airflow/issues/27341

   ### Description
   
   When using `KubernetesExecutor` and task workers based on the offical Docker 
image,  [there is an 
Entrypoint](https://github.com/apache/airflow/blob/main/Dockerfile#L1408) that 
runs a shell-script [where **airflow db check** is 
called](https://github.com/apache/airflow/blob/main/scripts/docker/entrypoint_prod.sh#L282).
   
https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/scripts/docker/entrypoint_prod.sh#L282-L283
   
   As opposed to celery tasks, kubernetes tasks go through this whole process 
for every task instance started. With high number of task instances, this path 
needs a bit more consideration.
   
   Currently i see two issues
   * Python interpreter startup is famously slow, especially on large projects 
such as airflow. Even if i haven't measured this particular aspect to know if 
it has significant impact or not, starting airflow twice just to make one ping 
is redundant. 
   * New Postgresql connections are slow and expensive for the database. If the 
db check could be run from within the same process as the task, the connection 
pool from within sqlalchemy can reuse the same connection after it has been 
successful. This could cut the number of connections made almost in half. 
Correct me if i'm wrong about this, i'm not super deeply familiar with the 
sqlalchemy internals and how it's integrated in Airflow. (Pgbouncer supposedly 
helps with this but less connections are always better regardless).
   
   ----
   My suggestion is to do something equivalent of just adding db.check() inside 
task_command.task_run().
   This could eventually also allow more rich code doing the check-retries and 
logging, and simplify the entrypoint from any shell-script.
   There is a `check_db=False` decorator on the `task_run` today. Setting this 
to True however also includes db-migrations in the check. I'm not familiar 
enough with this to say if that's a good idea or not.
   
   
https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/cli/commands/task_command.py#L312-L313
   
https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/utils/cli.py#L98-L102
   
   ----
   A workaround is to set the envvar `CONNECTION_CHECK_MAX_COUNT=0` in your 
`pod_template_file.yaml`. I don't know how that could impact stability of 
running tasks in case the DB is unreachable on task startup. In a way that 
scenario should ideally be handled anyway, as the check is just a sample and 
potential race condition for the DB to be unreachable on next real operation.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to