[ 
https://issues.apache.org/jira/browse/OOZIE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cecily Myles updated OOZIE-3721:
--------------------------------
    Description: 
When my cluster is loaded, I am faced with the problem of hanging subsidiaries 
in the status of "RUNNING". I get such a mistake when working with the HIVE 
tables. But also, I managed to reproduce the problem, launching the usual 
calculation of the number of pi in many subsidiaries, imitating the load.

I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has 
"SUCCESS" status.

Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf/job
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-25 15:55 GMT
Started       : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended         : -
CoordAction ID: -Actions
-------------------------------------------------------------------------------------------------------------------------
ID                                                       Status    Ext ID       
          Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start:             OK        -            
          OK         -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork                OK        -            
          OK         -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7               OK        
0067643-240125161152217-oozie-oozi-WSUCCEEDED  -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9               OK        
0067640-240125161152217-oozie-oozi-WSUCCEEDED  -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10              RUNNING   
0067641-240125161152217-oozie-oozi-WRUNNING    -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5               OK        
0067645-240125161152217-oozie-oozi-WSUCCEEDED  -
-------------------------------------------------------------------------------------------------------------------------
 {code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-26 04:20 GMT
Started       : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended         : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
-------------------------------------------------------------------------------------------------------------------------
ID                                                       Status    Ext ID       
          Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start:             OK        -            
          OK         -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork                OK        -            
          OK         -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21              RUNNING   
application_1706187939089_147514RUNNING    -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22              RUNNING   
application_1706187939089_147519RUNNING    -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18              RUNNING   
application_1706187939089_147518RUNNING    -
-------------------------------------------------------------------------------------------------------------------------{code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
        Application-Id : application_1706187939089_147514
        Application-Name : 
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
        Application-Type : Oozie Launcher
        User : cecyl
        Queue : default
        Application Priority : 0
        Start-Time : 1706259786568
        Finish-Time : 1706259853156
        Progress : 100%
        State : FINISHED
        Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem 
- reducing the load and restarting the application. But such a solution is not 
normal for me.

There are no signs in the laying and server logs that something is going wrong. 
Someone has ideas why such behavior can appear?

  was:
When my cluster is loaded, I am faced with the problem of hanging subsidiaries 
in the status of "RUNNING". I get such a mistake when working with the HIVE 
tables. But also, I managed to reproduce the problem, launching the usual 
calculation of the number of pi in many subsidiaries, imitating the load.

I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has 
"SUCCESS" status.

Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf/job
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-25 15:55 GMT
Started       : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended         : -
CoordAction ID: -Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                       Status    Ext ID       
          Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start:             OK        -            
          OK         -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork                OK        -            
          OK         -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7               OK        
0067643-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9               OK        
0067640-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10              RUNNING   
0067641-240125161152217-oozie-oozi-WRUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5               OK        
0067645-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
 {code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-26 04:20 GMT
Started       : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended         : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                       Status    Ext ID       
          Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start:             OK        -            
          OK         -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork                OK        -            
          OK         -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21              RUNNING   
application_1706187939089_147514RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22              RUNNING   
application_1706187939089_147519RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18              RUNNING   
application_1706187939089_147518RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
 {code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
        Application-Id : application_1706187939089_147514
        Application-Name : 
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
        Application-Type : Oozie Launcher
        User : cecyl
        Queue : default
        Application Priority : 0
        Start-Time : 1706259786568
        Finish-Time : 1706259853156
        Progress : 100%
        State : FINISHED
        Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem 
- reducing the load and restarting the application. But such a solution is not 
normal for me.

There are no signs in the laying and server logs that something is going wrong. 
Someone has ideas why such behavior can appear?


> Subsidiaries freeze in the status of "RUNNING" during a high load on the 
> cluster
> --------------------------------------------------------------------------------
>
>                 Key: OOZIE-3721
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3721
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 5.2.0
>            Reporter: Cecily Myles
>            Priority: Blocker
>
> When my cluster is loaded, I am faced with the problem of hanging 
> subsidiaries in the status of "RUNNING". I get such a mistake when working 
> with the HIVE tables. But also, I managed to reproduce the problem, launching 
> the usual calculation of the number of pi in many subsidiaries, imitating the 
> load.
> I launch an Oozie workflow with the following structure:
> {code:java}
> -- Oozie workflow
> ------> subworkflow_1
> ---------- fork_1
> ---------- fork_2
> ---------- ...
> ---------- fork_n
> ------> subworkflow_2
> ---------- fork_1
> ---------- fork_2
> ---------- ...
> ---------- fork_n {code}
> One of the fork have status "RUNNING" but if you open this fork, then it has 
> "SUCCESS" status.
> Parent workflow:
> {code:java}
> Job ID : 0061971-240125161152217-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------
> Workflow Name : test-subworkflow
> App Path      : hdfs://mycluster:8020/user/cecyl/subwf/job
> Status        : RUNNING
> Run           : 0
> User          : cecyl
> Group         : -
> Created       : 2024-01-25 15:55 GMT
> Started       : 2024-01-25 15:55 GMT
> Last Modified : 2024-01-30 06:24 GMT
> Ended         : -
> CoordAction ID: -Actions
> -------------------------------------------------------------------------------------------------------------------------
> ID                                                       Status    Ext ID     
>             Ext Status Err Code
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@:start:             OK        -          
>             OK         -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork                OK        -          
>             OK         -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork7               OK        
> 0067643-240125161152217-oozie-oozi-WSUCCEEDED  -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork9               OK        
> 0067640-240125161152217-oozie-oozi-WSUCCEEDED  -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork10              RUNNING   
> 0067641-240125161152217-oozie-oozi-WRUNNING    -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork5               OK        
> 0067645-240125161152217-oozie-oozi-WSUCCEEDED  -
> -------------------------------------------------------------------------------------------------------------------------
>  {code}
> Running subworkflow:
> {code:java}
> Job ID : 0067641-240125161152217-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------------------
> Workflow Name : test-subworkflow
> App Path      : hdfs://mycluster:8020/user/cecyl/subwf
> Status        : RUNNING
> Run           : 0
> User          : cecyl
> Group         : -
> Created       : 2024-01-26 04:20 GMT
> Started       : 2024-01-26 04:20 GMT
> Last Modified : 2024-01-26 08:23 GMT
> Ended         : -
> CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
> -------------------------------------------------------------------------------------------------------------------------
> ID                                                       Status    Ext ID     
>             Ext Status Err Code
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@:start:             OK        -          
>             OK         -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork                OK        -          
>             OK         -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork21              RUNNING   
> application_1706187939089_147514RUNNING    -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork22              RUNNING   
> application_1706187939089_147519RUNNING    -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork18              RUNNING   
> application_1706187939089_147518RUNNING    -
> -------------------------------------------------------------------------------------------------------------------------{code}
> But, running app have state "SUCCEEDED" and "FINISHED"
> {code:java}
> Application Report :
>         Application-Id : application_1706187939089_147514
>         Application-Name : 
> oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
>         Application-Type : Oozie Launcher
>         User : cecyl
>         Queue : default
>         Application Priority : 0
>         Start-Time : 1706259786568
>         Finish-Time : 1706259853156
>         Progress : 100%
>         State : FINISHED
>         Final-State : SUCCEEDED {code}
> The problem began to appear more often after tuning the HA. Solving the 
> problem - reducing the load and restarting the application. But such a 
> solution is not normal for me.
> There are no signs in the laying and server logs that something is going 
> wrong. Someone has ideas why such behavior can appear?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to