[ 
https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511135#comment-15511135
 ] 

Tyrone Hinderson edited comment on AIRFLOW-366 at 9/21/16 8:50 PM:
-------------------------------------------------------------------

Cool to see someone's working on this--I think a solution could be a big win 
for the tech. In my experience this issue is difficult to reproduce on-demand, 
but it happened pretty reliably whenever a scheduler ran for ~12 hours. As for 
`max_threads`, that didn't affect the issue for me (back in July when this 
issue was filed).


was (Author: cognalog):
Cool to see someone's working on this--I think a solution could be a big win 
for the tech. In my experience this issue is difficult to reproduce on-demand, 
but it happened pretty reliably whenever a scheduler ran for ~12 hours. As for 
max_threads, that didn't affect the issue for me (back in July when this issue 
was filed).

> SchedulerJob gets locked up when when child processes attempt to log to 
> single file
> -----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-366
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-366
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>            Reporter: Greg Neiheisel
>            Assignee: Bolke de Bruin
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will 
> eventually lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line 
> 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process 
> using pythons multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or 
> more) will end up getting suck and not terminating, resulting in the while 
> loop getting hung up.  You can `kill -9 PID` the child process manually, and 
> the loop will end and the scheduler will go on it's way, until it happens 
> again.
> The issue is due to usage of the logging module from within the child 
> processes.  From what I can tell, logging to a file from multiple processes 
> is not supported by the multiprocessing module, but it is supported using 
> python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right 
> when it is forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be 
> hit by the child process (jobs.py, models.py), and was able to keep the 
> scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to