joshzana commented on issue #26101:
URL: https://github.com/apache/airflow/issues/26101#issuecomment-1234552838

   +1 seeing the same thing on 2.3.4, KubernetesExector.  Wanted to provide 
more info.
   
   We run 40 DAGs with several thousand total daily tasks, and every day we 
have 5-10 tasks that get stuck in Queued state.
   
   We see the following sequence in the logs:
   First this error the first time the task attempts to run
   ```
   Invalid executor_config for 
["tenant_extraction_acme_recruiting_id_72","discovery_finder_on_29735_agg_month","scheduled__2022-08-31T11:00:00+00:00",1,-1]
   
   ```
   
   The in a repeated loop we see logs like this:
   ```
   {base_executor.py:211} INFO - task 
TaskInstanceKey(dag_id='tenant_extraction_acme_recruiting_id_72', 
task_id='discovery_finder_on_29735_agg_month', 
run_id='scheduled__2022-08-31T11:00:00+00:00', try_number=1, map_index=-1) is 
still running
    ```
   for each stuck task, followed shortly by:
   ```
   {base_executor.py:215} ERROR - could not queue task 
TaskInstanceKey(dag_id='tenant_extraction_acme_recruiting_id_72', 
task_id='discovery_finder_on_29735_agg_month', 
run_id='scheduled__2022-08-31T11:00:00+00:00', try_number=1, map_index=-1) 
(still running after 4 attempts)
   ```
   
   This continues until the task is marked as failed in the airflow UI.
   
   
   The executor_config on the invalid tasks looks like this:
   ```
   {'pod_override': {<Encoding.VAR: '__var'>: {'spec': {'containers': [{'name': 
'base', 'resources': {'limits': {}, 'requests': {'memory': '16Gi', 'cpu': 
'1'}}}]}}, <Encoding.TYPE: '__type'>: <DagAttributeTypes.POD: 'k8s.V1Pod'>}}
   ```
   
   
   Which is odd, because for the thousands of unaffected tasks it looks like 
this:
   ```
   {'pod_override': {'api_version': None, 'kind': None, 'metadata': None, 
'spec': {'active_deadline_seconds': None, 'affinity': None, 
'automount_service_account_token': None, 'containers': [{'args': None, 
'command': None, 'env': None, 'env_from': None, 'image': None, 
'image_pull_policy': None, 'lifecycle': None, 'liveness_probe': None, 'name': 
'base', 'ports': None, 'readiness_probe': None, 'resources': {'limits': {}, 
'requests': {'cpu': '1', 'memory': '16Gi'}}, 'security_context': None, 
'startup_probe': None, 'stdin': None, 'stdin_once': None, 
'termination_message_path': None, 'termination_message_policy': None, 'tty': 
None, 'volume_devices': None, 'volume_mounts': None, 'working_dir': None}], 
'dns_config': None, 'dns_policy': None, 'enable_service_links': None, 
'ephemeral_containers': None, 'host_aliases': None, 'host_ipc': None, 
'host_network': None, 'host_pid': None, 'hostname': None, 'image_pull_secrets': 
None, 'init_containers': None, 'node_name': None, 'node_selector': None, '
 os': None, 'overhead': None, 'preemption_policy': None, 'priority': None, 
'priority_class_name': None, 'readiness_gates': None, 'restart_policy': None, 
'runtime_class_name': None, 'scheduler_name': None, 'security_context': None, 
'service_account': None, 'service_account_name': None, 'set_hostname_as_fqdn': 
None, 'share_process_namespace': None, 'subdomain': None, 
'termination_grace_period_seconds': None, 'tolerations': None, 
'topology_spread_constraints': None, 'volumes': None}, 'status': None}}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to