[
https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071748#comment-16071748
]
Mike Perry commented on AIRFLOW-366:
------------------------------------
We're still seeing this in airflow 1.8.1. Any other thoughts on a possible
workaround? We've tried removing all log statements from jobs.py and models.py,
and replaced setup_logging per [~bolke]'s syslog suggestion above.
> SchedulerJob gets locked up when when child processes attempt to log to
> single file
> -----------------------------------------------------------------------------------
>
> Key: AIRFLOW-366
> URL: https://issues.apache.org/jira/browse/AIRFLOW-366
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
> Reporter: Greg Neiheisel
> Assignee: Bolke de Bruin
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will
> eventually lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line
> 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process
> using pythons multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or
> more) will end up getting suck and not terminating, resulting in the while
> loop getting hung up. You can `kill -9 PID` the child process manually, and
> the loop will end and the scheduler will go on it's way, until it happens
> again.
> The issue is due to usage of the logging module from within the child
> processes. From what I can tell, logging to a file from multiple processes
> is not supported by the multiprocessing module, but it is supported using
> python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right
> when it is forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be
> hit by the child process (jobs.py, models.py), and was able to keep the
> scheduler alive.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)