jonathanjuursema commented on issue #28010:
URL: https://github.com/apache/airflow/issues/28010#issuecomment-1346775177

   Of course! While trying to reproduce the issue the situation has changed. 
(I'm not sure why. I didn't commit the last configuration because I couldn't 
get it to work, so in reproducing I've started the process again. I'll make 
sure to save the config this time so that we can iterate on it if needed.)
   
   The docker containers for the worker, webserver and scheduler have the 
following environment variables set (all config is done via environment 
variables):
   ```
   (airflow)printenv | grep AIRFLOW
   AIRFLOW__CORE__HOSTNAME_CALLABLE=socket.gethostname
   AIRFLOW__CORE__LOAD_EXAMPLES=false
   AIRFLOW_INSTALLATION_METHOD=
   AIRFLOW_USER_HOME_DIR=/home/airflow
   AIRFLOW__SMTP__SMTP_PASSWORD=xxx
   AIRFLOW__SMTP__SMTP_HOST=xxx
   AIRFLOW_PIP_VERSION=22.3.1
   AIRFLOW__SMTP__SMTP_USER=xxx
   AIRFLOW__SMTP__SMTP_SSL=false
   AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=3600
   AIRFLOW_HOME=/opt/airflow
   
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN_CMD=/opt/airflow/airflow_construct_sql_conn_str.sh
   AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=300
   AIRFLOW__SMTP__SMTP_PORT=587
   AIRFLOW__SMTP__SMTP_STARTTLS=true
   AIRFLOW_UID=50000
   AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth, 
airflow.api.auth.backend.session
   AIRFLOW__CORE__ENABLE_XCOM_PICKLING=true
   AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
   AIRFLOW__CORE__EXECUTOR=CeleryExecutor
   AIRFLOW__CORE__FERNET_KEY=xxx
   AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT=false
   
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
   AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG
   AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
   AIRFLOW_VERSION=2.4.3
   AIRFLOW__SMTP__SMTP_MAIL_FROM=xxx
   AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=true
   ```
   
   ```
   (airflow)printenv | grep CELERY
   CELERY_SSL_ACTIVE=true
   AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
   
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
   ```
   
   ```
   (airflow)printenv | grep REDIS
   REDIS_BROKER_MASTER_PASSWORD=xxx
   REDIS_BROKER_MASTER_NAME=xxx
   
REDIS_BROKER_URL=sentinel://xxx:26379;sentinel://xxx:26379;sentinel://xxx:26379
   ```
   
   I've also mounted the following file in 
`/opt/airflow/config/retail_celery_config.py`:
   ```python
   from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
   import os
   
   CELERY_CONFIG = {
       **DEFAULT_CELERY_CONFIG,
       'broker_url': 
'{broker_url}?ssl_cert_reqs=none'.format(broker_url=os.getenv('REDIS_BROKER_URL')),
       'broker_transport_options': {
           'password': os.getenv('REDIS_BROKER_MASTER_PASSWORD'),
           'master_name': os.getenv('REDIS_BROKER_MASTER_NAME')
       }
   }
   ```
   
   What I now observe is interesting. I _think_ the scheduler is working. 
Neither the webserver and scheduler are throwing relevant errors, and the 
webserver doesn't show the "scheduler hasn't run in xxx minutes" banner. If 
there's additional checks I can do, please let me know.
   
   However, the worker still won't start:
   ```
   [2022-12-12 15:46:10,811: ERROR/MainProcess] consumer: Cannot connect to 
sentinel://xxx:26379//: No master found for 'xxx'.
   Will retry using next failover.
   
   [2022-12-12 15:46:10,828: ERROR/MainProcess] consumer: Cannot connect to 
sentinel://xxx:26379//: No master found for 'xxx'.
   Will retry using next failover.
   
   [2022-12-12 15:46:10,840: ERROR/MainProcess] consumer: Cannot connect to 
sentinel://xxx:26379//: No master found for 'xxx'.
   Trying again in 32.00 seconds... (16/100)
   ```
   
   This indicates that it _does_ fetch the right values (or at least the master 
name and sentinel list). Using the reference `redis-cli` I can validate the 
Redis configuration does work:
   
   ```
   ➜  src ./redis-cli -p 26379 --tls --insecure
   127.0.0.1:26379> sentinel get-master-addr-by-name non-existing-master-name
   (nil)
   127.0.0.1:26379> sentinel get-master-addr-by-name xxx
   1) "xx.xx.xx.xx"
   2) "7003"
   ```
   
   It should be noted that both the Sentinels and Redis Masters are using a 
non-public CA (to make things even worse). Either the scheduler/webserver 
accepts the `ssl_cert_reqs=none` from above and the worker doesn't, or the 
worker actually attempts an SSL connection, and the scheduler/webserver don't 
attempt one or don't log the attempts.
   
   It should also be noted that we use an internally managed Redis/Sentinel 
cluster (which I don't control, we're just a user). However, we have various 
other applications deployed using Redis in the same cluster and from the same 
application machines (effectively using the same firewall/network path), and 
the other applications do work as intended, so my first hunch is that the 
problem is not with the Redis/Sentinel cluster itself.
   
   For now I'm ignoring certificate validation, but if I can get this to work 
I'd like to mount the CA PEM and specify that in the broker URL.
   
   Please let me know if you need any additional information, snippets or if 
you have further troubleshooting ideas. The help is greatly appreciated!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to