keysersoza opened a new issue, #40054:
URL: https://github.com/apache/airflow/issues/40054

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.5.3
   
   ### What happened?
   
   Suddenly tasks started failing without any log or error. They are not even 
starting; they fail as soon as they leave the Queued status.
   
   The log shows:
   ```
   *** Log file does not exist: 
/opt/airflow/logs/dag_id=.../run_id=.../task_id=.../attempt=1.log
   *** Fetching from: 
http://:8793/log/dag_id=.../run_id=.../task_id=.../attempt=1.log
   *** Failed to fetch log file from worker. Request URL is missing an 
'http://' or 'https://' protocol.
   ```
   
   Digging a bit into the Scheduler, we found out that the Scheduler logs:
   ```
   dependency 'Task Instance State' FAILED: Task is in the 'failed' state.
   ```
   
   Something is setting the status of our Tasks to Failed before they even 
started.
   From some more investigation, the issue seems related or similar to: 
https://github.com/apache/airflow/issues/16163
   
   We tried changing `visibility_timeout` on the celery broker transport 
settings to 24hrs, but it did not help.
   
   The workers process a batch of jobs fine, then they start crumble with a lot 
of these Failed with apparently no reason.
   
   We also looked into the statsd metrics and we found the following errors: 
`celery.task_timeout_error` but we couldn't figure out what are these related 
to.
   
   We really have no idea of what could be happening. We run Workers on 
Kubernetes Pods and the resources are fine, no pods are being evicted.
   
   We use redis as a backend (1 instance), postgresql as a DB (1 active 2 
passive nodes) together with pgpool and pgbouncer.
   
   Any ideas? We can't currently upgrade but we are planning one.
   
   ### What you think should happen instead?
   
   Tasks should execute normally 
   
   ### How to reproduce
   
   I don't really know how to reproduce this
   
   ### Operating System
   
   Linux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Kubernetes Pods for Workers
   Helm chart for Airflow Core
   Redis
   Postgresql 1 active 2 passive nodes
   PgPool
   PgBouncer
   6 replicas per scheduler/webserver/triggerer
   
   ### Anything else?
   
   After processing a batch of jobs, the issue will start happening. Task 
instances go in FAILED status without any apparent reason
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to