[ 
https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394620#comment-15394620
 ] 

Tyrone Hinderson edited comment on AIRFLOW-366 at 7/26/16 9:58 PM:
-------------------------------------------------------------------

+1

I have observed this as well. At some point the scheduler will log `Starting {} 
scheduler jobs` (line ~754), then it will go quiet.
Meanwhile, a child scheduler process can be observed sleeping repeatedly

{code}
# ps jx
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     5     94      1      1 ?            -1 Rl       0 4409:23 /usr/bin/python 
/usr/local/bin/airflow scheduler
    94   6776      1      1 ?            -1 S        0   0:00 /usr/bin/python 
/usr/local/bin/airflow scheduler <<< this one
    7021   7030   7030   7021 ?          7030 R+       0   0:00 ps jx
{code}

These processes pop up occasionally, but are typically short-lived. Killing the 
child process will cause the scheduler to spring back to life, declaring 

{code}
{jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
{jobs.py:744} INFO - Loop took: {} seconds
{code}

I wonder whether this issue is related to the very 
[issue|https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb]
 for which num_runs is recommended as a workaround?


was (Author: cognalog):
+1

I have observed this as well. At some point the scheduler will log `Starting {} 
scheduler jobs` (line ~754), then it will go quiet.
Meanwhile, a child scheduler process can be observed sleeping repeatedly

{code}
# ps jx
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     5     94      1      1 ?            -1 Rl       0 4409:23 /usr/bin/python 
/usr/local/bin/airflow scheduler
    94   6776      1      1 ?            -1 S        0   0:00 /usr/bin/python 
/usr/local/bin/airflow scheduler <<< this one
    7021   7030   7030   7021 ?          7030 R+       0   0:00 ps jx
{code}

These processes pop up occasionally, but are typically short-lived. Killing the 
child process will cause the scheduler to spring back to life, declaring 

{code}
{jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
{jobs.py:744} INFO - Loop took: {} seconds
{code}

I wonder whether this issue is related to the very 
[issue](https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb)
 for which num_runs is recommended as a workaround?

> SchedulerJob gets locked up when when child processes attempt to log to 
> single file
> -----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-366
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-366
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>            Reporter: Greg Neiheisel
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will 
> eventually lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line 
> 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process 
> using pythons multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or 
> more) will end up getting suck and not terminating, resulting in the while 
> loop getting hung up.  You can `kill -9 PID` the child process manually, and 
> the loop will end and the scheduler will go on it's way, until it happens 
> again.
> The issue is due to usage of the logging module from within the child 
> processes.  From what I can tell, logging to a file from multiple processes 
> is not supported by the multiprocessing module, but it is supported using 
> python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right 
> when it is forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be 
> hit by the child process (jobs.py, models.py), and was able to keep the 
> scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to