[ https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394620#comment-15394620 ]
Tyrone Hinderson edited comment on AIRFLOW-366 at 7/26/16 9:58 PM: ------------------------------------------------------------------- +1 I have observed this as well. At some point the scheduler will log `Starting {} scheduler jobs` (line ~754), then it will go quiet. Meanwhile, a child scheduler process can be observed sleeping repeatedly {code} # ps jx PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 5 94 1 1 ? -1 Rl 0 4409:23 /usr/bin/python /usr/local/bin/airflow scheduler 94 6776 1 1 ? -1 S 0 0:00 /usr/bin/python /usr/local/bin/airflow scheduler <<< this one 7021 7030 7030 7021 ? 7030 R+ 0 0:00 ps jx {code} These processes pop up occasionally, but are typically short-lived. Killing the child process will cause the scheduler to spring back to life, declaring {code} {jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat {jobs.py:744} INFO - Loop took: {} seconds {code} I wonder whether this issue is related to the very [issue|https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb] for which num_runs is recommended as a workaround? was (Author: cognalog): +1 I have observed this as well. At some point the scheduler will log `Starting {} scheduler jobs` (line ~754), then it will go quiet. Meanwhile, a child scheduler process can be observed sleeping repeatedly {code} # ps jx PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 5 94 1 1 ? -1 Rl 0 4409:23 /usr/bin/python /usr/local/bin/airflow scheduler 94 6776 1 1 ? -1 S 0 0:00 /usr/bin/python /usr/local/bin/airflow scheduler <<< this one 7021 7030 7030 7021 ? 7030 R+ 0 0:00 ps jx {code} These processes pop up occasionally, but are typically short-lived. Killing the child process will cause the scheduler to spring back to life, declaring {code} {jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat {jobs.py:744} INFO - Loop took: {} seconds {code} I wonder whether this issue is related to the very [issue](https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb) for which num_runs is recommended as a workaround? > SchedulerJob gets locked up when when child processes attempt to log to > single file > ----------------------------------------------------------------------------------- > > Key: AIRFLOW-366 > URL: https://issues.apache.org/jira/browse/AIRFLOW-366 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler > Reporter: Greg Neiheisel > > After running the scheduler for a while (usually after 1 - 5 hours) it will > eventually lock up, and nothing will get scheduled. > A `SchedulerJob` will end up getting stuck in the `while` loop around line > 730 of `airflow/jobs.py`. > From what I can tell this is related to logging from within a forked process > using pythons multiprocessing module. > The job will fork off some child processes to process the DAGs but one (or > more) will end up getting suck and not terminating, resulting in the while > loop getting hung up. You can `kill -9 PID` the child process manually, and > the loop will end and the scheduler will go on it's way, until it happens > again. > The issue is due to usage of the logging module from within the child > processes. From what I can tell, logging to a file from multiple processes > is not supported by the multiprocessing module, but it is supported using > python multithreading, using some sort of locking mechanism. > I think a child process will somehow inherit a logger that is locked, right > when it is forked, resulting it the process completely locking up. > I went in and commented out all the logging statements that could possibly be > hit by the child process (jobs.py, models.py), and was able to keep the > scheduler alive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)