potiuk commented on issue #30593: URL: https://github.com/apache/airflow/issues/30593#issuecomment-1505054849
We definitely need the log. I looked at the change and I can think of only one reason why you could have the metrics increased - namely that there is a recurring problem with some database operation related to callbacks, and retries simply increased the time it takes to run the opertions several times until they fail. This might indicate some consistent problem with either bug in Airflow or some deadlock/operational issue with your database, but we need more data to take a closer look and diagnose it fully. Can you tell us how many dag file processors you have @wookiist ? So I have two kind requests @wookiist: 1) could you please increase log level for dag file processor to debug (it can be only for the processor to not impact other components): https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#logging-level (it can be done via adding the `AIRFLOW__LOGGING__LOGGING_LEVEL` env variable and restarting the processor) and sharing a gist with the logs f dag file processor. Also if you could see if there are any errror messages produced in your database logs, signalling that there are some queries failing from correlated time of when dag file processor runs, that could help a lot as well. 2) Can you please take a look at the setting of `scheduler 'max_callbacks_for_loop` configuration (default is 20) . And if you could lower it down to say 5 (if you have 20), and see if that helps. I have a theory that this parameter might make your queries either run long or run into deadlocks and that might cause the queries to retry few times. Longer explanation (and theory I have): The processing time you see is not only for parsing, it is also for running callbacks related to the DAG being processed. When you have callbacks that cannot be run in the worker related to DAG file, they are executed by the DAG file processor after the file is parsed. How it works is that the main parsing loop will start processor for each file and feed it with the callbacks to execute. The change you mentioned does not directly contribute to the time, but it produces callbacks that the file processor executes. And those callbacks actually contribute to the time you see - so the increase of the time you see might not come from "parsing" but it might come from callback processing, and it could be tha the db_retry mechanism introduced in the PR you mentioned, has a problem that it (accidentally) sends multiple repeated callbacks to the processor - and that increases the processing time. Having the logs, we will be able to see if that is happening and likely we should be able to see the reason (possibly database logs might give more explanation what's going on). cc: @mhenc I think that one is closely related to standalone file processor case and might need your insights once we get more information -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
