[jira] [Comment Edited] (AIRFLOW-5881) Dag gets stuck in "Scheduled" State when scheduling a large number of tasks

Matsubara Yuya (Jira) Sun, 22 Dec 2019 01:41:40 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001844#comment-17001844
 ]


Matsubara Yuya edited comment on AIRFLOW-5881 at 12/22/19 9:40 AM:
-------------------------------------------------------------------

KubernetesExecutorを使用して、私も同じ問題に直面しています。

1つのDAG内で4000を超えるタスクを実行すると、DAGは正しく実行されません。DAGをトリガーすると、DAG状態が実行状態になります。ただし、タスクをスケジュールせず、DAGは永遠に実行状態にとどまります（WebUIではDAG実行=
 '実行中'、最近のタスク状態= 'なし'）。

1.10.5に戻しても問題は解決しません。 

*スケジューラログ*（kubectl logs airflow-85bfcd8c86-v4lps scheduler）が繰り返し表示されます：

[2019-12-22 08：42：42,278]  {{scheduler_job.py:210}}警告-PID XXXXXXを強制終了しています。

*DAGログ*：

[2019-12-22 09：09：07,007]  {{scheduler_job.py:1507}}情報- /root /airflow/ dags 
/d20.pyから取得したDAG（s）dict_keys（['d20manytest']）
 [2019-12-22 09 ：09：07,118]  {{scheduler_job.py:1212}} INFO-d20manytestの処理
 [2019-12-22 09：09：07,133]  {{scheduler_job.py:1225 }}INFO-作成済みなし
 [2019-12-22 09：09：07,149]  {{scheduler_job.py ：690}}情報-DAG実行の調査<DagRun 
d20manytest @ 2019-12-20 14：00：00 + 00：00：schedule__2019-12-20T14：00：00 + 
00：00、外部トリガー：False> 
 [2019-12- 22 09：11：05188]  {{logging_mixin.py:90}} INFO - 
[2019年12月22日09：11：05187]  
{{settings.py:175}}情報-settings.configure_orm（）：プール設定の使用。pool_size = 
0、max_overflow = 10、pool_recycle = 1800、pid = 21046 
 [2019-12-22 09：11：05,190]  {{scheduler_job.py:142}} INFO-/ root / airflow / 
dagsで動作する開始プロセス（PID = 21046） /d20.py 
 [2019-12-22 09：11：05,192]  {{scheduler_job.py:1495}} 
INFO-タスクをキューに入れるためにファイル/root/airflow/dags/d20.pyを処理しています
 [2019-12-22 09：11：05,193 ]  {{logging_mixin.py:90}} INFO - 
[2019年12月22日09：11：05193]  {{dagbag.py:90}} INFO - 
/root/airflow/dags/d20.pyからDagBagを埋めます

 

Any idea why would that happen?

 


was (Author: yuya):
Using with the KubernetesExecutor, I am facing the same issue, too.

When running over 4000 tasks inside one DAG, DAG does not run properly. 
Triggering DAG, DAG state come into Running state. But, not scheduling tasks 
and DAG is staying into Running state eternally ( DAG Runs = 'running', Recent 
Tasks state = 'none' on the WebUI) .

Reverting to 1.10.5 not fixes the issue. 

*The scheduler logs* ( kubectl logs airflow-85bfcd8c86-v4lps scheduler ) shows 
repeatedly:

[2019-12-22 08:42:42,278] {{scheduler_job.py:210}} WARNING - Killing PID XXXXXX.

*DAG log* :

[2019-12-22 09:09:07,007] {{scheduler_job.py:1507}} INFO - DAG(s) 
dict_keys(['d20manytest']) retrieved from /root/airflow/dags/d20.py
 [2019-12-22 09:09:07,118] {{scheduler_job.py:1212}} INFO - Processing 
d20manytest
 [2019-12-22 09:09:07,133] {{scheduler_job.py:1225}} INFO - Created None
 [2019-12-22 09:09:07,149] {{scheduler_job.py:690}} INFO - Examining DAG run 
<DagRun d20manytest @ 2019-12-20 14:00:00+00:00: 
scheduled__2019-12-20T14:00:00+00:00, externally triggered: False>
 [2019-12-22 09:11:05,188] {{logging_mixin.py:90}} INFO - [2019-12-22 
09:11:05,187] {{settings.py:175}} INFO - settings.configure_orm(): Using pool 
settings. pool_size=0, max_overflow=10, pool_recycle=1800, pid=21046
 [2019-12-22 09:11:05,190] {{scheduler_job.py:142}} INFO - Started process 
(PID=21046) to work on /root/airflow/dags/d20.py
 [2019-12-22 09:11:05,192] {{scheduler_job.py:1495}} INFO - Processing file 
/root/airflow/dags/d20.py for tasks to queue
 [2019-12-22 09:11:05,193] {{logging_mixin.py:90}} INFO - [2019-12-22 
09:11:05,193] {{dagbag.py:90}} INFO - Filling up the DagBag from 
/root/airflow/dags/d20.py

 

> Dag gets stuck in "Scheduled" State when scheduling a large number of tasks
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5881
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5881
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.6
>            Reporter: David Hartig
>            Priority: Critical
>         Attachments: 2 (1).log, airflow.cnf
>
>
> Running with the KubernetesExecutor in and AKS cluster, when we upgraded to 
> version 1.10.6 we noticed that the all the Dags stop making progress but 
> start running and immediate exiting with the following message:
> "Instance State' FAILED: Task is in the 'scheduled' state which is not a 
> valid state for execution. The task must be cleared in order to be run."
> See attached log file for the worker. Nothing seems out of the ordinary in 
> the Scheduler log. 
> Reverting to 1.10.5 clears the problem.
> Note that at the time of the failure maybe 100 or so tasks are in this state, 
> with 70 coming from one highly parallelized dag. Clearing the scheduled tasks 
> just makes them reappear shortly thereafter. Marking them "up_for_retry" 
> results in one being executed but then the system is stuck in the original 
> zombie state. 
> Attached is the also a redacted airflow config flag. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (AIRFLOW-5881) Dag gets stuck in "Scheduled" State when scheduling a large number of tasks

Reply via email to