cdibble commented on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-1020844854
>
I am having a similar experience. I'm on `Airflow v2.2.2` with
`KubernetesExecutor` in AWS EKS. I observe when a DAG run kicks off, tasks are
quickly queued, but some relatively small fraction of the queued tasks are
moved to a running state. At the same time, kubernetes reports
`FailedCreatePodSandBox` on tasks that are in the queued state. My scheduler
has plenty of resources, but I see the `OSError: handle is closed` error coming
up in the scheduler logs right around when queued tasks stop being moved to the
running state. Those tasks stay queued until some of the already-executing
tasks complete. In my case, those tasks will eventually be set to running and
complete, but I can never seem to get the concurrency I want. There's plenty of
room in the pool, and I am fairly certain I am not limited by the various
concurrency/parallelism settings.
Below are the scheduler logs when I get a similar `OSError` as @danmactough
as well as the kubectl events logs from the time when a new DAG run was
executing and tasks were moving from scheduled to queued to executing.
scheduler logs
```
[2022-01-25 05:49:54,253] {kubernetes_executor.py:147} INFO - Event:
<redacted_A> had an event of type MODIFIED
[2022-01-25 05:49:54,253] {kubernetes_executor.py:213} INFO - Event:
<redacted_A> is Running
[2022-01-25 05:49:54,654] {kubernetes_executor.py:147} INFO - Event:
<redacted_B> had an event of type MODIFIED
[2022-01-25 05:49:54,654] {kubernetes_executor.py:213} INFO - Event:
<redacted_B> is Running
[2022-01-25 05:49:55,054] {kubernetes_executor.py:147} INFO - Event:
<redacted_C> had an event of type MODIFIED
[2022-01-25 05:49:55,054] {kubernetes_executor.py:201} INFO - Event:
<redacted_C> Pending
[2022-01-25 05:49:55,653] {kubernetes_executor.py:147} INFO - Event:
<redacted_D> had an event of type MODIFIED
[2022-01-25 05:49:55,653] {kubernetes_executor.py:213} INFO - Event:
<redacted_D> is Running
[2022-01-25 05:50:06,704] {kubernetes_executor.py:454} INFO - Found 15
queued task instances
[2022-01-25 05:50:45,076] {processor.py:244} WARNING - Killing
DAGFileProcessorProcess (PID=12643)
Process ForkProcess-22:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in
_bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in
run
self._target(*self._args, **self._kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
line 277, in _run_processor_manager
processor_manager.start()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
line 510, in start
return self._run_parsing_loop()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
line 570, in _run_parsing_loop
self._collect_results_from_processor(processor)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
line 894, in _collect_results_from_processor
if processor.result is not None:
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/processor.py",
line 322, in result
if not self.done:
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/processor.py",
line 287, in done
if self._parent_channel.poll():
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 255,
in poll
self._check_closed()
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 136,
in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
[2022-01-25 05:50:45,858] {manager.py:308} WARNING - DagFileProcessorManager
(PID=12136) exited with exit code 1 - re-launching
[2022-01-25 05:50:45,861] {manager.py:163} INFO - Launched
DagFileProcessorManager with pid: 12873
[2022-01-25 05:50:45,868] {settings.py:52} INFO - Configured default
timezone Timezone('UTC')
[2022-01-25 05:50:45,870] {settings.py:462} INFO - Loaded
airflow_local_settings from /opt/airflow/config/airflow_local_settings.py .
[2022-01-25 05:50:53,756] {scheduler_job.py:1114} INFO - Resetting orphaned
tasks for active dag runs
[2022-01-25 05:51:07,317] {kubernetes_executor.py:454} INFO - Found 15
queued task instances
```
kubectl events
```
60m Normal Created pod/<redacted_A>
Created container base
60m Warning Failed pod/<redacted_B>
Error: context deadline exceeded
60m Normal Pulled pod/<redacted_C>
Container image "k8s.gcr.io/git-sync/git-sync:v3.3.0" already
present on machine
60m Normal Pulled pod/<redacted_D>
Container image "k8s.gcr.io/git-sync/git-sync:v3.3.0" already
present on machine
60m Warning FailedCreatePodSandBox pod/<redacted_E>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod
"gefs50etlv4extracts3getatmos2000.88a63f73338a4f38b67f248ded8726dc": operation
timeout: context deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_F>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to start sandbox container for pod "<redacted_F>": operation timeout:
context deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_G>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod "<redacted_G>": operation timeout: context
deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_H>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod "<redacted_H>": operation timeout: context
deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_I>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod "<redacted_I>": operation timeout: context
deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_J>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod "<redacted_J>": operation timeout: context
deadline exceeded
60m Warning FailedCreatePodSandBox pod/<redacted_K>
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create a sandbox for pod "<redacted_K>": operation timeout: context
deadline exceeded
```
Happy to provide more info if requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]