[
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089818#comment-16089818
]
Rick Otten commented on AIRFLOW-401:
------------------------------------
According to the logs, ever since we set it up, it restarts itself regularly.
Before we had a lot of tasks, it was more often than since we added a lot of
tasks. This is what the logs look like during a routine restart (which cleans
up all of the dead children):
{code:plain text}
Jul 15 12:54:38 myserver airflow[4467]: [2017-07-15 12:54:38,253]
{jobs.py:1313} INFO - Exited execute loop
Jul 15 12:54:43 myserver systemd[1]: airflow-scheduler.service: Service
hold-off time over, scheduling restart.
Jul 15 12:54:43 myserver systemd[1]: Stopped Airflow scheduler daemon.
Jul 15 12:54:43 myserver systemd[1]: Started Airflow scheduler daemon.
Jul 15 12:54:43 myserver dd.collector[1540]: WARNING (disk.py:106): Unable to
get disk metrics for /sys/kernel/debug/tracing: [Errno 13] Permission denied:
'/sys/kernel/debug/tracing'
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,364]
{__init__.py:57} INFO - Using executor LocalExecutor
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,450]
{driver.py:120} INFO - Generating grammar tables from
/usr/lib/python3.5/lib2to3/Grammar.txt
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,476]
{driver.py:120} INFO - Generating grammar tables from
/usr/lib/python3.5/lib2to3/PatternGrammar.txt
Jul 15 12:54:44 myserver airflow[20289]: ____________ _____________
Jul 15 12:54:44 myserver airflow[20289]: ____ |__( )_________ __/__
/________ __
Jul 15 12:54:44 myserver airflow[20289]: ____ /| |_ /__ ___/_ /_ __ /_ __
\_ | /| / /
Jul 15 12:54:44 myserver airflow[20289]: ___ ___ | / _ / _ __/ _ / / /_/
/_ |/ |/ /
Jul 15 12:54:44 myserver airflow[20289]: _/_/ |_/_/ /_/ /_/ /_/
\____/____/|__/
Jul 15 12:54:44 myserver airflow[20289]:
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,777]
{jobs.py:1264} INFO - Starting the scheduler
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1280} INFO - Processing files using up to 4 processes at a time
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1282} INFO - Running execute loop for -1 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1284} INFO - Processing each file at most 100 times
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1286} INFO - Process each file at most once every 120 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1288} INFO - Checking for new files in /home/airflow/airflow/dags
every 300 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778]
{jobs.py:1291} INFO - Searching for files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,779]
{jobs.py:1294} INFO - There are 20 files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]:
/home/airflow/airflow/lib/python3.5/site-packages/airflow/www/app.py:23:
FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to
"CSRFProtect" and will be removed in 1.0.
Jul 15 12:54:44 myserver airflow[20289]: csrf = CsrfProtect()
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,994]
{jobs.py:1356} INFO - Resetting state for orphaned tasks
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,008]
{jobs.py:1366} INFO - Resetting backup_clarivoypg 2017-07-14 06:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,018]
{jobs.py:1366} INFO - Resetting daily_web_response_processor 2017-07-14 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,023]
{jobs.py:1366} INFO - Resetting ping_pgbouncers 2017-07-15 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,027]
{jobs.py:1407} INFO - Heartbeating the process manager
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,032]
{dag_processing.py:627} INFO - Started a process (PID: 20462) to generate tasks
for /home/airflow/airflow/dags/my_daily.py - logging into
/home/airflow/airflow/logs/scheduler/2017-07-15/my_daily.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,036]
{dag_processing.py:627} INFO - Started a process (PID: 20463) to generate tasks
for /home/airflow/airflow/dags/my_hourly.py - logging into
/home/airflow/airflow/logs/scheduler/2017-07-15/my_hourly.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,042]
{dag_processing.py:627} INFO - Started a process (PID: 20464) to generate tasks
for /home/airflow/airflow/dags/my_odd_job.py - logging into
/home/airflow/airflow/logs/scheduler/2017-07-15/my_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,047]
{dag_processing.py:627} INFO - Started a process (PID: 20465) to generate tasks
for /home/airflow/airflow/dags/my_other_odd_job.py - logging into
/home/airflow/airflow/logs/scheduler/2017-07-15/my_other_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,049]
{jobs.py:1443} INFO - Heartbeating the executor
{code}
> scheduler gets stuck without a trace
> ------------------------------------
>
> Key: AIRFLOW-401
> URL: https://issues.apache.org/jira/browse/AIRFLOW-401
> Project: Apache Airflow
> Issue Type: Bug
> Components: executor, scheduler
> Affects Versions: Airflow 1.7.1.3
> Reporter: Nadeem Ahmed Nazeer
> Assignee: Bolke de Bruin
> Priority: Minor
> Labels: celery, kombu
> Attachments: Dag_code.txt, schduler_cpu100%.png,
> scheduler_stuck_7hours.png, scheduler_stuck.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU
> usage of scheduler service is at 100%. No jobs get submitted and everything
> comes to a halt. Looks it goes into some kind of infinite loop.
> The only way I could make it run again is by manually restarting the
> scheduler service. But again, after running some tasks it gets stuck. I've
> tried with both Celery and Local executors but same issue occurs. I am using
> the -n 3 parameter while starting scheduler.
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)