potiuk commented on issue #33688:
URL: https://github.com/apache/airflow/issues/33688#issuecomment-1693268984

   I looked shortly and:
   
   The logs below are coming from the same forked process. 
   
   Between this:
   
   ```
   [2023-08-23, 18:23:04 UTC] {standard_task_runner.py:85} INFO - Job 1180119: 
Subtask fetch_header
   ```
   
   and this:
   
   ```
   [2023-08-23, 18:23:35 UTC] {task_command.py:415} INFO - Running 
<TaskInstance: mydagname.fetch_header 
mydagname__bot_extract_sales__v10678108plzv__2023-08-21T20:36:29.046083 
[running]> on host 90c66b7612c1
   ```
   
   the following things happen:
   
   1) setproctitle -> setting title of the process
   2) setting few environment variables: _AIRFLOW_PARSING_CONTEXT_DAG_ID, 
_AIRFLOW_PARSING_CONTEXT_TASK_ID
   3) parsing command parameters ("airflow task run --raw ...."
   4) loading config file prepared before the process is forked
   5) on_starting() listener is fired
   6) dag/tasks are parsed from DAG files or if you are using pickling (likely 
now) from pickled representation
   
   From those actions, I could likely exclude firs 4 (unless your filesystem is 
broken) and by the deduction method if we exclude the impossible, what remains 
must be the reason. So it's either 5) or 6).
   
   My best guess @jaetma is that either:
   
   Hypothesis 1): your on_start_listener is hanging on something -> not very 
likely that you already have some listener but since you were on 2.6, it's 
possible.
   
   Hypothesis 2): parsing your DAGs inside the Celery worker is hanging on 
something. The most likely reason is that you have some TOP level code that 
(for example) does a networking call that hangs.
   
   My intelligent guess and the 30 seconds suggest that your DNS is 
misconfigured/broken or networking prevents it from responding quickly. 
   
   Hanging on DNS call is quite plausible hypothesis. From what I remember 30 
seconds is often default DNS resolution timeout. So my best guest is that 
somewhere during your migration your environment's networking gets broken in 
your Docker Compose and the DNS you have is not working properly, thus making 
whatever you do on Top level of your DAG (which BTW you [should 
not](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code)
 if you do) slow to respond.
   
   
![image](https://github.com/apache/airflow/assets/595491/2b869cee-db97-466b-b3fd-82059dfc79c0)
   
   Hypothesis 2a)
   
   Another variant of the Hypothesis 2) if you are using Airflow Variables at 
the top of your DAG code or any other database access (which BTW you [should 
not](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code)
 if you do), this might lead to a database connection. And opening new DB 
connection might be problematic from your server point of view if there are 
already many connections opened. You will see in your database server by high 
number of opened connections. And Postgres does not cope well with the number 
of connection airflow opens so if you use Postgres, and do not have Pgbouncer 
between Airflow and Postgres - this might be the reason. I would love if you 
could check it because I have a reason to believe we could have many more 
connections opened in Airflow 2.7 (I have just a suspicion about it). So if 
this could be checked - you should see a much larger number  of connections to 
your DB if my guess is right when you run 2.7). If y
 ou could check that hypothesis, that would be great.
   
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to