jonathanjuursema commented on issue #28010:
URL: https://github.com/apache/airflow/issues/28010#issuecomment-1346775177
Of course! While trying to reproduce the issue the situation has changed.
(I'm not sure why. I didn't commit the last configuration because I couldn't
get it to work, so in reproducing I've started the process again. I'll make
sure to save the config this time so that we can iterate on it if needed.)
The docker containers for the worker, webserver and scheduler have the
following environment variables set (all config is done via environment
variables):
```
(airflow)printenv | grep AIRFLOW
AIRFLOW__CORE__HOSTNAME_CALLABLE=socket.gethostname
AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW_INSTALLATION_METHOD=
AIRFLOW_USER_HOME_DIR=/home/airflow
AIRFLOW__SMTP__SMTP_PASSWORD=xxx
AIRFLOW__SMTP__SMTP_HOST=xxx
AIRFLOW_PIP_VERSION=22.3.1
AIRFLOW__SMTP__SMTP_USER=xxx
AIRFLOW__SMTP__SMTP_SSL=false
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=3600
AIRFLOW_HOME=/opt/airflow
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN_CMD=/opt/airflow/airflow_construct_sql_conn_str.sh
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=300
AIRFLOW__SMTP__SMTP_PORT=587
AIRFLOW__SMTP__SMTP_STARTTLS=true
AIRFLOW_UID=50000
AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth,
airflow.api.auth.backend.session
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=true
AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__FERNET_KEY=xxx
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT=false
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG
AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
AIRFLOW_VERSION=2.4.3
AIRFLOW__SMTP__SMTP_MAIL_FROM=xxx
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=true
```
```
(airflow)printenv | grep CELERY
CELERY_SSL_ACTIVE=true
AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
```
```
(airflow)printenv | grep REDIS
REDIS_BROKER_MASTER_PASSWORD=xxx
REDIS_BROKER_MASTER_NAME=xxx
REDIS_BROKER_URL=sentinel://xxx:26379;sentinel://xxx:26379;sentinel://xxx:26379
```
I've also mounted the following file in
`/opt/airflow/config/retail_celery_config.py`:
```python
from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os
CELERY_CONFIG = {
**DEFAULT_CELERY_CONFIG,
'broker_url':
'{broker_url}?ssl_cert_reqs=none'.format(broker_url=os.getenv('REDIS_BROKER_URL')),
'broker_transport_options': {
'password': os.getenv('REDIS_BROKER_MASTER_PASSWORD'),
'master_name': os.getenv('REDIS_BROKER_MASTER_NAME')
}
}
```
What I now observe is interesting. I _think_ the scheduler is working.
Neither the webserver and scheduler are throwing relevant errors, and the
webserver doesn't show the "scheduler hasn't run in xxx minutes" banner. If
there's additional checks I can do, please let me know.
However, the worker still won't start:
```
[2022-12-12 15:46:10,811: ERROR/MainProcess] consumer: Cannot connect to
sentinel://xxx:26379//: No master found for 'xxx'.
Will retry using next failover.
[2022-12-12 15:46:10,828: ERROR/MainProcess] consumer: Cannot connect to
sentinel://xxx:26379//: No master found for 'xxx'.
Will retry using next failover.
[2022-12-12 15:46:10,840: ERROR/MainProcess] consumer: Cannot connect to
sentinel://xxx:26379//: No master found for 'xxx'.
Trying again in 32.00 seconds... (16/100)
```
This indicates that it _does_ fetch the right values (or at least the master
name and sentinel list). Using the reference `redis-cli` I can validate the
Redis configuration does work:
```
➜ src ./redis-cli -p 26379 --tls --insecure
127.0.0.1:26379> sentinel get-master-addr-by-name non-existing-master-name
(nil)
127.0.0.1:26379> sentinel get-master-addr-by-name xxx
1) "xx.xx.xx.xx"
2) "7003"
```
It should be noted that both the Sentinels and Redis Masters are using a
non-public CA (to make things even worse). Either the scheduler/webserver
accepts the `ssl_cert_reqs=none` from above and the worker doesn't, or the
worker actually attempts an SSL connection, and the scheduler/webserver don't
attempt one or don't log the attempts.
It should also be noted that we use an internally managed Redis/Sentinel
cluster (which I don't control, we're just a user). However, we have various
other applications deployed using Redis in the same cluster and from the same
application machines (effectively using the same firewall/network path), and
the other applications do work as intended, so my first hunch is that the
problem is not with the Redis/Sentinel cluster itself.
For now I'm ignoring certificate validation, but if I can get this to work
I'd like to mount the CA PEM and specify that in the broker URL.
Please let me know if you need any additional information, snippets or if
you have further troubleshooting ideas. The help is greatly appreciated!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]