sa1 opened a new issue, #34405:
URL: https://github.com/apache/airflow/issues/34405

   ### Apache Airflow version
   
   2.7.1
   
   ### What happened
   
   I enabled the experimental OTEL integration, and sometimes the connection to 
OTEL collector fails. Such connection failures are expected and common. 
However, right now the task seems to fail and there is an extra point of 
failure added to each task and DAG. Sometimes the failures are before the DAG 
is even started, and task-level retries can't help.
   
   The only error message I see in this case is the connection failure.
   ```
   urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', 
port=9999): Max retries exceeded with url: /v1/metrics (Caused by 
NewConnectionError('<urllib3.connection.HTTPConnection object at 
0x7ff41c054430>: Failed to establish a new connection: [Errno 111] Connection 
refused'))
   ```
   
   This is not printed to the Airflow UI, only to the worker logs, so it's not 
obvious why a task/DAG failed.
   
   ### What you think should happen instead
   
   In this situation, Airflow should print a warning and continue with the task.
   
   When any other python application is auto-instrumented with otel, the 
automatic instrumentation works in the desired way, it ignores connection 
failures and only prints out a warning message.
   
   Maybe this setting could be configurable, but the desired behaviour should 
be to ignore the exception.
   
   ### How to reproduce
   
   Enable OTEL integration, and turn off the collector. Run any DAG/task and 
they will fail.
   
   ### Operating System
   
   Ubuntu 22.04.3 LTS
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==8.6.0
   apache-airflow-providers-celery==3.3.3
   apache-airflow-providers-common-sql==1.7.1
   apache-airflow-providers-ftp==3.5.1
   apache-airflow-providers-http==4.5.1
   apache-airflow-providers-imap==3.3.1
   apache-airflow-providers-openlineage==1.0.2
   apache-airflow-providers-postgres==5.6.0
   apache-airflow-providers-redis==3.3.1
   apache-airflow-providers-slack==8.0.0
   apache-airflow-providers-snowflake==5.0.0
   apache-airflow-providers-sqlite==3.4.3
   apache-airflow-providers-ssh==3.7.2
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   Docker based custom deployment on ECS Fargate.
   Separate fargate tasks for webserver, worker, scheduler and triggerer.
   Otel collector is running as an agent in each task.
   
   ### Anything else
   
   The task fails everytime the connection to otel collector fails. However why 
the otel collector fails sometimes is the subject of another investigation. 
Maybe it has to do with something with the size of data/metrics being sent to 
the collector. But I believe those reasons are not very relevant to this bug.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to