potiuk commented on issue #33688:
URL: https://github.com/apache/airflow/issues/33688#issuecomment-1693268984
I looked shortly and:
The logs below are coming from the same forked process.
Between this:
```
[2023-08-23, 18:23:04 UTC] {standard_task_runner.py:85} INFO - Job 1180119:
Subtask fetch_header
```
and this:
```
[2023-08-23, 18:23:35 UTC] {task_command.py:415} INFO - Running
<TaskInstance: mydagname.fetch_header
mydagname__bot_extract_sales__v10678108plzv__2023-08-21T20:36:29.046083
[running]> on host 90c66b7612c1
```
the following things happen:
1) setproctitle -> setting title of the process
2) setting few environment variables: _AIRFLOW_PARSING_CONTEXT_DAG_ID,
_AIRFLOW_PARSING_CONTEXT_TASK_ID
3) parsing command parameters ("airflow task run --raw ...."
4) loading config file prepared before the process is forked
5) on_starting() listener is fired
6) dag/tasks are parsed from DAG files or if you are using pickling (likely
now) from pickled representation
From those actions, I could likely exclude firs 4 (unless your filesystem is
broken) and by the deduction method if we exclude the impossible, what remains
must be the reason. So it's either 5) or 6).
My best guess @jaetma is that either:
Hypothesis 1): your on_start_listener is hanging on something -> not very
likely that you already have some listener but since you were on 2.6, it's
possible.
Hypothesis 2): parsing your DAGs inside the Celery worker is hanging on
something. The most likely reason is that you have some TOP level code that
(for example) does a networking call that hangs.
My intelligent guess and the 30 seconds suggest that your DNS is
misconfigured/broken or networking prevents it from responding quickly.
Hanging on DNS call is quite plausible hypothesis. From what I remember 30
seconds is often default DNS resolution timeout. So my best guest is that
somewhere during your migration your environment's networking gets broken in
your Docker Compose and the DNS you have is not working properly, thus making
whatever you do on Top level of your DAG (which BTW you [should
not](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code)
if you do) slow to respond.

Hypothesis 2a)
Another variant of the Hypothesis 2) if you are using Airflow Variables at
the top of your DAG code or any other database access (which BTW you [should
not](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code)
if you do), this might lead to a database connection. And opening new DB
connection might be problematic from your server point of view if there are
already many connections opened. You will see in your database server by high
number of opened connections. And Postgres does not cope well with the number
of connection airflow opens so if you use Postgres, and do not have Pgbouncer
between Airflow and Postgres - this might be the reason. I would love if you
could check it because I have a reason to believe we could have many more
connections opened in Airflow 2.7 (I have just a suspicion about it). So if
this could be checked - you should see a much larger number of connections to
your DB if my guess is right when you run 2.7). If y
ou could check that hypothesis, that would be great.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]