[ https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511135#comment-15511135 ]
Tyrone Hinderson commented on AIRFLOW-366: ------------------------------------------ Cool to see someone's working on this--I think a solution could be a big win for the tech. In my experience this issue is difficult to reproduce on-demand, but it happened pretty reliably whenever a scheduler ran for ~12 hours. As for max_threads, that didn't affect the issue for me (back in July when this issue was filed). > SchedulerJob gets locked up when when child processes attempt to log to > single file > ----------------------------------------------------------------------------------- > > Key: AIRFLOW-366 > URL: https://issues.apache.org/jira/browse/AIRFLOW-366 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler > Reporter: Greg Neiheisel > Assignee: Bolke de Bruin > > After running the scheduler for a while (usually after 1 - 5 hours) it will > eventually lock up, and nothing will get scheduled. > A `SchedulerJob` will end up getting stuck in the `while` loop around line > 730 of `airflow/jobs.py`. > From what I can tell this is related to logging from within a forked process > using pythons multiprocessing module. > The job will fork off some child processes to process the DAGs but one (or > more) will end up getting suck and not terminating, resulting in the while > loop getting hung up. You can `kill -9 PID` the child process manually, and > the loop will end and the scheduler will go on it's way, until it happens > again. > The issue is due to usage of the logging module from within the child > processes. From what I can tell, logging to a file from multiple processes > is not supported by the multiprocessing module, but it is supported using > python multithreading, using some sort of locking mechanism. > I think a child process will somehow inherit a logger that is locked, right > when it is forked, resulting it the process completely locking up. > I went in and commented out all the logging statements that could possibly be > hit by the child process (jobs.py, models.py), and was able to keep the > scheduler alive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)