potiuk commented on issue #30593:
URL: https://github.com/apache/airflow/issues/30593#issuecomment-1505054849

   We definitely need the log. I looked at the change and I can think  of only 
one reason why you could have the metrics increased - namely that there is a 
recurring problem with some database operation related to callbacks, and 
retries simply increased the time it takes to run the opertions several times 
until they fail. This might indicate some consistent problem with either bug in 
Airflow or some deadlock/operational issue with your database, but we need more 
data to take a closer look and diagnose it fully.
   
   Can you tell us how many dag file processors you have @wookiist  ?
   
   So I have two kind requests @wookiist:
   
   1)  could you please increase log level for dag file processor to debug (it 
can be only for the processor to not impact other components): 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#logging-level
 (it can be done via adding the `AIRFLOW__LOGGING__LOGGING_LEVEL` env variable 
and restarting the processor) and sharing a gist with the logs f dag file 
processor. Also if you could see if there are any errror messages produced in 
your database logs, signalling that there are some queries failing from 
correlated time of when dag file processor runs, that could help a lot as well.
   
   2)  Can you please take a look at the setting of `scheduler 
'max_callbacks_for_loop` configuration (default is 20) . And if you could lower 
it down to say 5 (if you have 20), and see if that helps. I have a theory that 
this parameter might make your queries either run long or run into deadlocks 
and that might cause the queries to retry few times.
   
   Longer explanation (and theory I have):
   
   The processing time you see is not only for parsing, it is also for running 
callbacks related to the DAG being processed. When you have callbacks that 
cannot be run in the worker related to DAG file, they are executed by the DAG 
file processor after the file is parsed. How it works is that the main parsing 
loop will start processor for each file and feed it with the callbacks to 
execute. The change you mentioned does not directly contribute to the time, but 
it produces callbacks that the file processor executes. And those callbacks 
actually contribute to the time you see - so the increase of the time you see 
might not come from "parsing" but it might come from callback processing, and 
it could be tha the db_retry mechanism introduced in the PR you mentioned, has 
a problem that it (accidentally) sends multiple repeated callbacks to the 
processor - and that increases the processing time.
   
   Having the logs, we will be able to see if that is happening and likely we 
should be able to see the reason (possibly database logs might give more 
explanation what's going on).
   
   cc: @mhenc I think that one is closely related to standalone file processor 
case and might need your insights once we get more information
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to