Cecily Myles created OOZIE-3721:
-----------------------------------

             Summary: Subsidiaries freeze in the status of "RUNNING" during a 
high load on the cluster
                 Key: OOZIE-3721
                 URL: https://issues.apache.org/jira/browse/OOZIE-3721
             Project: Oozie
          Issue Type: Bug
          Components: core
    Affects Versions: 5.2.0
            Reporter: Cecily Myles


When my cluster is loaded, I am faced with the problem of hanging subsidiaries 
in the status of "RUNNING". I get such a mistake when working with the HIVE 
tables. But also, I managed to reproduce the problem, launching the usual 
calculation of the number of pi in many subsidiaries, imitating the load.

I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has 
"SUCCESS" status.

Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf/job
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-25 15:55 GMT
Started       : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended         : -
CoordAction ID: -Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            
Status    Ext ID                 Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start:                                  
OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork                                     
OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7                                    
OK        0067643-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9                                    
OK        0067640-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10                                   
RUNNING   0067641-240125161152217-oozie-oozi-WRUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5                                    
OK        0067645-240125161152217-oozie-oozi-WSUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
 {code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path      : hdfs://mycluster:8020/user/cecyl/subwf
Status        : RUNNING
Run           : 0
User          : cecyl
Group         : -
Created       : 2024-01-26 04:20 GMT
Started       : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended         : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            
Status    Ext ID                 Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start:                                  
OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork                                     
OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21                                   
RUNNING   application_1706187939089_147514RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22                                   
RUNNING   application_1706187939089_147519RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18                                   
RUNNING   application_1706187939089_147518RUNNING    -
------------------------------------------------------------------------------------------------------------------------------------
 {code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
        Application-Id : application_1706187939089_147514
        Application-Name : 
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
        Application-Type : Oozie Launcher
        User : cecyl
        Queue : default
        Application Priority : 0
        Start-Time : 1706259786568
        Finish-Time : 1706259853156
        Progress : 100%
        State : FINISHED
        Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem 
- reducing the load and restarting the application. But such a solution is not 
normal for me.

There are no signs in the laying and server logs that something is going wrong. 
Someone has ideas why such behavior can appear?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to