[ 
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651649#comment-15651649
 ] 

Kevin Gao commented on AIRFLOW-401:
-----------------------------------

We seem to be running into a similar issue on both versions 1.7.0 and 1.7.1.3. 
I'm wondering if this behavior is expected, and we're just incorrectly using 
the airflow scheduler + LocalExecutor. From what I can tell, the scheduler may 
have run through its last iteration for the life of the process, but is waiting 
for its child processes to complete (the local executors executing the 
long-running tasks). As a result, no further tasks are able to be scheduled 
until the long running task is completed. My current thoughts are we should 
probably switch to the CeleryExecutor in order to make the scheduler 
independent of the executors.

Relevant configs:
{code:ini}
[core]
executor = LocalExecutor
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = False
max_active_runs_per_dag = 16

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
{code}

The scheduler is run using upstart for {{monospaced}}-n 5{{monospaced}} 
iterations.

Some symptoms:
- No logs being produced by scheduler
- The scheduler appears to be blocked on a long-running task
- 31 of the 32 airflow child processes are listed as defunct
- Killing the long-running tasks allows the scheduler to become "unstuck". The 
scheduler then seems to finish its final iteration, and is then respawned by 
upstart.

Here is the output from pstree:
{code}
─airflow,4984 usr/local/bin/airflow scheduler -n 5
   ├─(airflow,4990)
   ├─(airflow,4991)
   ├─airflow,4992 usr/local/bin/airflow scheduler -n 5
   │   └─airflow,5086 /usr/local/bin/airflow run dag_name 2016-11-09T01:20:00 
--local -sd DAGS_FOLDER/dag_name.py
   │       └─airflow,5092 /usr/local/bin/airflow run dag_name dag_name 
2016-11-09T01:20:00 --job_id 582112 --raw -sd DAGS_FOLDER/dag_name.py
   │           └─bash,5102 /tmp/airflowtmpOyW_H1/dag_nameRf_OMJ
   │               ├─sudo,5105 -u someuser node /path/to/some_script.js
   │               │   └─node,5107 /path/to/some_script.js
   │               │       ├─{node},5109
   │               │       ├─{node},5110
   │               │       ├─{node},5111
   │               │       ├─{node},5112
   │               │       ├─{node},5113
   │               │       ├─{node},5114
   │               │       ├─{node},5115
   │               │       └─{node},5116
   │               └─sudo,5106 -u someuser tee -a /var/log/some/log/file.log
   │                   └─tee,5108 -a /var/log/some/log/file.log
   ├─(airflow,4993)
   ├─(airflow,4994)
   ├─(airflow,4995)
   ├─(airflow,4996)
   ├─(airflow,4997)
   ├─(airflow,4998)
   ├─(airflow,4999)
   ├─(airflow,5000)
   ├─(airflow,5001)
   ├─(airflow,5002)
   ├─(airflow,5003)
   ├─(airflow,5004)
   ├─(airflow,5005)
   ├─(airflow,5006)
   ├─(airflow,5007)
   ├─(airflow,5008)
   ├─(airflow,5009)
   ├─(airflow,5010)
   ├─(airflow,5011)
   ├─(airflow,5012)
   ├─(airflow,5013)
   ├─(airflow,5014)
   ├─(airflow,5015)
   ├─(airflow,5016)
   ├─(airflow,5017)
   ├─(airflow,5018)
   ├─(airflow,5019)
   ├─(airflow,5020)
   ├─(airflow,5021)
   └─{airflow},5029
{code}

stracing process 4992 shows that it's waiting for the child process to 
terminate {{monospaced}}wait4(5086,{{monospaced}}. stracing process 4984, the 
root process, shows it's also waiting for some state change, presumably for the 
child process to change state: {{monospaced}}futex(0x7f7ac5efc000, FUTEX_WAIT, 
0, NULL{{monospaced}}.

Here is some more complete strace output I had from a previous time when it was 
hung in this state:
{code}
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000ce0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 0
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL

########################################################
# At this point I manually killed the long running task: sudo kill 25116 #
########################################################

futex(0x7f9ec913f000, FUTEX_WAKE, 1)    = 1
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0c", 4)                   = 4
read(6, "\200\2U!DAG0_NAME"..., 99) = 99
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0h", 4)                   = 4
read(6, "\200\2U&DAG1_NAME"..., 104) = 104
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0V", 4)                   = 4
read(6, "\200\2U\24DAG2_NAME"..., 86) = 86
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0f", 4)                   = 4
read(6, "\200\2U%DAG3_NAME"..., 102) = 102
select(7, [6], NULL, NULL, {0, 0})      = 0 (Timeout)
munmap(0x7f9ec9138000, 32)              = 0
close(9)                                = 0
munmap(0x7f9ec913a000, 32)              = 0
close(8)                                = 0
munmap(0x7f9ec9139000, 32)              = 0
gettimeofday({1478679655, 152452}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
gettimeofday({1478679655, 153818}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 40) = 40
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 43) = 43
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 90) = 90
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 420) = 420
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)              = 5
read(3, "redacted"..., 535) = 535
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 195) = 195
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 44) = 44
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 42) = 42
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
wait4(25019, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25019
wait4(25021, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25021
wait4(25030, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25030
wait4(25020, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25020
wait4(25015, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25015
wait4(25017, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25017
wait4(25006, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25006
wait4(25012, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25012
wait4(25033, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25033
wait4(25007, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25007
wait4(25008, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25008
wait4(25009, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25009
wait4(25031, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25031
wait4(25026, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25026
wait4(25013, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25013
wait4(25011, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25011
wait4(25010, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25010
wait4(25022, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25022
wait4(25018, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25018
wait4(25029, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25029
wait4(25025, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25025
wait4(25024, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25024
wait4(25027, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25027
wait4(25034, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25034
wait4(25028, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25028
wait4(25023, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25023
wait4(25014, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25014
wait4(25032, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25032
wait4(25035, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25035
wait4(25037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25037
wait4(25036, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25036
wait4(25016, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25016
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], 
SA_RESTORER, 0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, 
[], SA_RESTORER, 0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, 
[], SA_RESTORER, 0x7f9ec8da2cb0}, 8) = 0
exit_group(0)                           = ?
{code}

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, 
> scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU 
> usage of scheduler service is at 100%. No jobs get submitted and everything 
> comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the 
> scheduler service. But again, after running some tasks it gets stuck. I've 
> tried with both Celery and Local executors but same issue occurs. I am using 
> the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to