cdibble commented on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-1020844854


   > 
   
   I am having a similar experience. I'm on `Airflow v2.2.2` with 
`KubernetesExecutor` in AWS EKS. I observe when a DAG run kicks off, tasks are 
quickly queued, but some relatively small fraction of the queued tasks are 
moved to a running state. At the same time, kubernetes reports 
`FailedCreatePodSandBox` on tasks that are in the queued state. My scheduler 
has plenty of resources, but I see the `OSError: handle is closed` error coming 
up in the scheduler logs right around when queued tasks stop being moved to the 
running state. Those tasks stay queued until some of the already-executing 
tasks complete. In my case, those tasks will eventually be set to running and 
complete, but I can never seem to get the concurrency I want. There's plenty of 
room in the pool, and I am fairly certain I am not limited by the various 
concurrency/parallelism settings.
   
   Below are the scheduler logs when I get a similar `OSError` as @danmactough 
as well as the kubectl events logs from the time when a new DAG run was 
executing and tasks were moving from scheduled to queued to executing. 
   
   scheduler logs
   ```
   [2022-01-25 05:49:54,253] {kubernetes_executor.py:147} INFO - Event: 
<redacted_A> had an event of type MODIFIED
   [2022-01-25 05:49:54,253] {kubernetes_executor.py:213} INFO - Event: 
<redacted_A> is Running
   [2022-01-25 05:49:54,654] {kubernetes_executor.py:147} INFO - Event: 
<redacted_B> had an event of type MODIFIED
   [2022-01-25 05:49:54,654] {kubernetes_executor.py:213} INFO - Event: 
<redacted_B> is Running
   [2022-01-25 05:49:55,054] {kubernetes_executor.py:147} INFO - Event: 
<redacted_C> had an event of type MODIFIED
   [2022-01-25 05:49:55,054] {kubernetes_executor.py:201} INFO - Event: 
<redacted_C> Pending
   [2022-01-25 05:49:55,653] {kubernetes_executor.py:147} INFO - Event: 
<redacted_D> had an event of type MODIFIED
   [2022-01-25 05:49:55,653] {kubernetes_executor.py:213} INFO - Event: 
<redacted_D> is Running
   [2022-01-25 05:50:06,704] {kubernetes_executor.py:454} INFO - Found 15 
queued task instances
   [2022-01-25 05:50:45,076] {processor.py:244} WARNING - Killing 
DAGFileProcessorProcess (PID=12643)
   Process ForkProcess-22:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in 
_bootstrap
       self.run()
     File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in 
run
       self._target(*self._args, **self._kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
 line 277, in _run_processor_manager
       processor_manager.start()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
 line 510, in start
       return self._run_parsing_loop()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
 line 570, in _run_parsing_loop
       self._collect_results_from_processor(processor)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/manager.py",
 line 894, in _collect_results_from_processor
       if processor.result is not None:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/processor.py",
 line 322, in result
       if not self.done:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/dag_processing/processor.py",
 line 287, in done
       if self._parent_channel.poll():
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 255, 
in poll
       self._check_closed()
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 136, 
in _check_closed
       raise OSError("handle is closed")
   OSError: handle is closed
   [2022-01-25 05:50:45,858] {manager.py:308} WARNING - DagFileProcessorManager 
(PID=12136) exited with exit code 1 - re-launching
   [2022-01-25 05:50:45,861] {manager.py:163} INFO - Launched 
DagFileProcessorManager with pid: 12873
   [2022-01-25 05:50:45,868] {settings.py:52} INFO - Configured default 
timezone Timezone('UTC')
   [2022-01-25 05:50:45,870] {settings.py:462} INFO - Loaded 
airflow_local_settings from /opt/airflow/config/airflow_local_settings.py .
   [2022-01-25 05:50:53,756] {scheduler_job.py:1114} INFO - Resetting orphaned 
tasks for active dag runs
   [2022-01-25 05:51:07,317] {kubernetes_executor.py:454} INFO - Found 15 
queued task instances
   ```
   
   kubectl events
   ```
   
   60m         Normal    Created                  pod/<redacted_A>              
                 Created container base
   60m         Warning   Failed                   pod/<redacted_B>              
                 Error: context deadline exceeded
   60m         Normal    Pulled                   pod/<redacted_C>              
                 Container image "k8s.gcr.io/git-sync/git-sync:v3.3.0" already 
present on machine
   60m         Normal    Pulled                   pod/<redacted_D>              
                 Container image "k8s.gcr.io/git-sync/git-sync:v3.3.0" already 
present on machine
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_E>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod 
"gefs50etlv4extracts3getatmos2000.88a63f73338a4f38b67f248ded8726dc": operation 
timeout: context deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_F>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to start sandbox container for pod "<redacted_F>": operation timeout: 
context deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_G>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod "<redacted_G>": operation timeout: context 
deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_H>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod "<redacted_H>": operation timeout: context 
deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_I>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod "<redacted_I>": operation timeout: context 
deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_J>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod "<redacted_J>": operation timeout: context 
deadline exceeded
   60m         Warning   FailedCreatePodSandBox   pod/<redacted_K>              
                 Failed to create pod sandbox: rpc error: code = Unknown desc = 
failed to create a sandbox for pod "<redacted_K>": operation timeout: context 
deadline exceeded
   ```
   
   Happy to provide more info if requested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to