rodrigoechaide commented on issue #11379:
URL: https://github.com/apache/airflow/issues/11379#issuecomment-891850328


   Hi, @Siddharthk are you still facing the same issue? Because I am facing the 
same issue when running a DAG that has 500 parallel tasks because I am running 
some stress tests on airflow. In the DAG task, I have an iterator parameter, 
and by changing it I can modify the duration of each task. It does not matter 
how long the task lasts, I have had the issue with tasks that last from seconds 
to more than 20 minutes. I am using KubernetesExecutor and when fetching the 
pods using kubectl I am getting this:
   
   ```
   k get pods -n airflow | grep Error
   
performancetest500tasksinparallel20taskperformancetest500tasksd.04c772dbda6c47b79b017c90b73055af
   0/1     Error       0          8m27s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.05269b668c8043c7b7ac32c0e06ce2bc
   0/1     Error       0          6m31s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.0819b03e3fda475abfd3893dc7598ffb
   0/1     Error       0          8m56s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.09f9fa1367194deabead2b7d6de72c83
   0/1     Error       0          8m2s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.0c61b81e7dfc4d17846c89d78eefac0c
   0/1     Error       0          5m59s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.0d0b39ea912a48c898d13b5392c0ee7e
   0/1     Error       0          8m41s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.0d1e17539b934616a0f72a05b530d88e
   0/1     Error       0          8m33s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.12e3fd2a030340589e251c987652c61e
   0/1     Error       0          9m16s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.1312a64638e34ee488d5f8839a29c0e6
   0/1     Error       0          7m25s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.1508cf02371d4dff8c925a3855a60911
   0/1     Error       0          7m31s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.1d3c9140a24e42c29fe5def938832759
   0/1     Error       0          7m17s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.1e5cee28a93b4f62bc1c06d1bb6ed785
   0/1     Error       0          8m30s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.214e5df400c24764b9104e5e324dc314
   0/1     Error       0          8m55s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.272b9e6502ce49078c68741731aa8144
   0/1     Error       0          7m39s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.2840867f20a34a4fae6ad71ff1ef2803
   0/1     Error       0          6m3s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.2aca869d190d4a17a60653788d73e090
   0/1     Error       0          7m22s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.2d6f588cba464f2c9aec0f75eff105a5
   0/1     Error       0          6m32s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.31513adf9a4d4faa910b8eeedf53b960
   0/1     Error       0          8m48s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.3600857bd1784617b4322ec304924870
   0/1     Error       0          8m58s
   
performancetest500tasksinparallel20taskperformancetest500tasksd.3659ef7cbcb345e99ba557e6ca6b881d
   0/1     Error       0          9m1s
   ```
   
   And when checking the logs of one of the tasks I am getting this error:
   
   ```
   k logs 
performancetest500tasksinparallel20taskperformancetest500tasksd.6426f08f727c4f15b2c041ce98f163d5
 -n airflow
   [2021-08-03 12:44:59,468] {cli_action_loggers.py:105} WARNING - Failed to 
log action with (psycopg2.OperationalError) could not translate host name 
"qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary 
failure in name resolution
   
   (Background on this error at: http://sqlalche.me/e/13/e3q8)
   [2021-08-03 12:44:59,469] {dagbag.py:496} INFO - Filling up the DagBag from 
/opt/airflow/dags/git/performance_test_500_tasks_in_parallel_2_0.py
   Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", 
line 2336, in _wrap_pool_connect
       return fn()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 495, in checkout
       rec = pool._do_get()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", 
line 241, in _do_get
       return self._create_connection()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 
68, in __exit__
       compat.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", 
line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 656, in __connect
       connection = pool._invoke_creator(self)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 
114, in connect
       return dialect.connect(*cargs, **cparams)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 
508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 
122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   psycopg2.OperationalError: could not translate host name 
"qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary 
failure in name resolution
   
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/local/bin/airflow", line 8, in <module>
       sys.exit(main())
     File "/usr/local/lib/python3.9/site-packages/airflow/__main__.py", line 
40, in main
       args.func(args)
     File "/usr/local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", 
line 48, in command
       return func(*args, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/cli.py", line 
91, in wrapper
       return f(*args, **kwargs)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", 
line 227, in task_run
       ti.refresh_from_db()
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/session.py", 
line 70, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 
625, in refresh_from_db
       ti = qry.first()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3429, in first
       ret = list(self[0:1])
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3203, in __getitem__
       return list(res)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3535, in __iter__
       return self._execute_and_instances(context)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3556, in _execute_and_instances
       conn = self._get_bind_args(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3571, in _get_bind_args
       return fn(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", 
line 3550, in _connection_from_session
       conn = self.session.connection(**kw)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 1142, in connection
       return self._connection_for_bind(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 1150, in _connection_for_bind
       return self.transaction._connection_for_bind(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 433, in _connection_for_bind
       conn = bind._contextual_connect()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", 
line 2302, in _contextual_connect
       self._wrap_pool_connect(self.pool.connect, None),
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", 
line 2339, in _wrap_pool_connect
       Connection._handle_dbapi_exception_noconnection(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", 
line 1583, in _handle_dbapi_exception_noconnection
       util.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", 
line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", 
line 2336, in _wrap_pool_connect
       return fn()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 495, in checkout
       rec = pool._do_get()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", 
line 241, in _do_get
       return self._create_connection()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 
68, in __exit__
       compat.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", 
line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", 
line 656, in __connect
       connection = pool._invoke_creator(self)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 
114, in connect
       return dialect.connect(*cargs, **cparams)
     File 
"/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 
508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 
122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not 
translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to 
address: Temporary failure in name resolution
   ```
   
   These are some of the configuration variables of my airflow cluster:
   
   ```
     AIRFLOW_HOME: "/opt/airflow"
     AIRFLOW__CORE__DAGS_FOLDER: "/opt/airflow/dags/git"
     AIRFLOW__LOGGING__BASE_LOG_FOLDER: "/opt/airflow/logs"
     AIRFLOW__LOGGING__LOGGING_LEVEL: "INFO" # DEBUG, INFO, WARNING, ERROR or 
CRITICAL.
     AIRFLOW__LOGGING__FAB_LOGGING_LEVEL: "WARNING"
     AIRFLOW__LOGGING__LOG_FILENAME_TEMPLATE: "{{ ti.dag_id }}/{{ ti.task_id 
}}/{{ ts }}/{{ try_number }}.log"
     AIRFLOW__LOGGING__LOG_FORMAT: "%(message)s"
     AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "60"
     AIRFLOW__CORE__DAG_CONCURRENCY: "500"
     AIRFLOW__CORE__PARALLELISM: "500"
     AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE: "0"
     AIRFLOW__CORE__EXECUTOR: "KubernetesExecutor"
     AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.default"
     AIRFLOW__CORE__LOAD_EXAMPLES: "False"
     AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: "1.1"
     AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "True"
     AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
     AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE: "1" 
     AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE: "/opt/airflow/template.yaml"
     AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"
   ```
   
   And besides that config, I have set up the `default_pool` size of 500 slots 
in order to be able to run 500 parallel tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to