keysersoza opened a new issue, #40054: URL: https://github.com/apache/airflow/issues/40054
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.5.3 ### What happened? Suddenly tasks started failing without any log or error. They are not even starting; they fail as soon as they leave the Queued status. The log shows: ``` *** Log file does not exist: /opt/airflow/logs/dag_id=.../run_id=.../task_id=.../attempt=1.log *** Fetching from: http://:8793/log/dag_id=.../run_id=.../task_id=.../attempt=1.log *** Failed to fetch log file from worker. Request URL is missing an 'http://' or 'https://' protocol. ``` Digging a bit into the Scheduler, we found out that the Scheduler logs: ``` dependency 'Task Instance State' FAILED: Task is in the 'failed' state. ``` Something is setting the status of our Tasks to Failed before they even started. From some more investigation, the issue seems related or similar to: https://github.com/apache/airflow/issues/16163 We tried changing `visibility_timeout` on the celery broker transport settings to 24hrs, but it did not help. The workers process a batch of jobs fine, then they start crumble with a lot of these Failed with apparently no reason. We also looked into the statsd metrics and we found the following errors: `celery.task_timeout_error` but we couldn't figure out what are these related to. We really have no idea of what could be happening. We run Workers on Kubernetes Pods and the resources are fine, no pods are being evicted. We use redis as a backend (1 instance), postgresql as a DB (1 active 2 passive nodes) together with pgpool and pgbouncer. Any ideas? We can't currently upgrade but we are planning one. ### What you think should happen instead? Tasks should execute normally ### How to reproduce I don't really know how to reproduce this ### Operating System Linux ### Versions of Apache Airflow Providers _No response_ ### Deployment Official Apache Airflow Helm Chart ### Deployment details Kubernetes Pods for Workers Helm chart for Airflow Core Redis Postgresql 1 active 2 passive nodes PgPool PgBouncer 6 replicas per scheduler/webserver/triggerer ### Anything else? After processing a batch of jobs, the issue will start happening. Task instances go in FAILED status without any apparent reason ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
