Cecily Myles created OOZIE-3721: ----------------------------------- Summary: Subsidiaries freeze in the status of "RUNNING" during a high load on the cluster Key: OOZIE-3721 URL: https://issues.apache.org/jira/browse/OOZIE-3721 Project: Oozie Issue Type: Bug Components: core Affects Versions: 5.2.0 Reporter: Cecily Myles
When my cluster is loaded, I am faced with the problem of hanging subsidiaries in the status of "RUNNING". I get such a mistake when working with the HIVE tables. But also, I managed to reproduce the problem, launching the usual calculation of the number of pi in many subsidiaries, imitating the load. I launch an Oozie workflow with the following structure: {code:java} -- Oozie workflow ------> subworkflow_1 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n ------> subworkflow_2 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n {code} One of the fork have status "RUNNING" but if you open this fork, then it has "SUCCESS" status. Parent workflow: {code:java} Job ID : 0061971-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf/job Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-25 15:55 GMT Started : 2024-01-25 15:55 GMT Last Modified : 2024-01-30 06:24 GMT Ended : - CoordAction ID: -Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork7 OK 0067643-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork9 OK 0067640-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork10 RUNNING 0067641-240125161152217-oozie-oozi-WRUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork5 OK 0067645-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ {code} Running subworkflow: {code:java} Job ID : 0067641-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-26 04:20 GMT Started : 2024-01-26 04:20 GMT Last Modified : 2024-01-26 08:23 GMT Ended : - CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork21 RUNNING application_1706187939089_147514RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork22 RUNNING application_1706187939089_147519RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork18 RUNNING application_1706187939089_147518RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ {code} But, running app have state "SUCCEEDED" and "FINISHED" {code:java} Application Report : Application-Id : application_1706187939089_147514 Application-Name : oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W Application-Type : Oozie Launcher User : cecyl Queue : default Application Priority : 0 Start-Time : 1706259786568 Finish-Time : 1706259853156 Progress : 100% State : FINISHED Final-State : SUCCEEDED {code} The problem began to appear more often after tuning the HA. Solving the problem - reducing the load and restarting the application. But such a solution is not normal for me. There are no signs in the laying and server logs that something is going wrong. Someone has ideas why such behavior can appear? -- This message was sent by Atlassian Jira (v8.20.10#820010)