[ https://issues.apache.org/jira/browse/OOZIE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cecily Myles updated OOZIE-3721: -------------------------------- Description: When my cluster is loaded, I am faced with the problem of hanging subsidiaries in the status of "RUNNING". I get such a mistake when working with the HIVE tables. But also, I managed to reproduce the problem, launching the usual calculation of the number of pi in many subsidiaries, imitating the load. I launch an Oozie workflow with the following structure: {code:java} -- Oozie workflow ------> subworkflow_1 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n ------> subworkflow_2 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n {code} One of the fork have status "RUNNING" but if you open this fork, then it has "SUCCESS" status. Parent workflow: {code:java} Job ID : 0061971-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf/job Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-25 15:55 GMT Started : 2024-01-25 15:55 GMT Last Modified : 2024-01-30 06:24 GMT Ended : - CoordAction ID: -Actions ------------------------------------------------------------------------------------------------------------------------- ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@fork7 OK 0067643-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@fork9 OK 0067640-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@fork10 RUNNING 0067641-240125161152217-oozie-oozi-WRUNNING - ------------------------------------------------------------------------------------------------------------------------- 0061971-240125161152217-oozie-oozi-W@fork5 OK 0067645-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------- {code} Running subworkflow: {code:java} Job ID : 0067641-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-26 04:20 GMT Started : 2024-01-26 04:20 GMT Last Modified : 2024-01-26 08:23 GMT Ended : - CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions ------------------------------------------------------------------------------------------------------------------------- ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------- 0067641-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------- 0067641-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------- 0067641-240125161152217-oozie-oozi-W@fork21 RUNNING application_1706187939089_147514RUNNING - ------------------------------------------------------------------------------------------------------------------------- 0067641-240125161152217-oozie-oozi-W@fork22 RUNNING application_1706187939089_147519RUNNING - ------------------------------------------------------------------------------------------------------------------------- 0067641-240125161152217-oozie-oozi-W@fork18 RUNNING application_1706187939089_147518RUNNING - -------------------------------------------------------------------------------------------------------------------------{code} But, running app have state "SUCCEEDED" and "FINISHED" {code:java} Application Report : Application-Id : application_1706187939089_147514 Application-Name : oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W Application-Type : Oozie Launcher User : cecyl Queue : default Application Priority : 0 Start-Time : 1706259786568 Finish-Time : 1706259853156 Progress : 100% State : FINISHED Final-State : SUCCEEDED {code} The problem began to appear more often after tuning the HA. Solving the problem - reducing the load and restarting the application. But such a solution is not normal for me. There are no signs in the laying and server logs that something is going wrong. Someone has ideas why such behavior can appear? was: When my cluster is loaded, I am faced with the problem of hanging subsidiaries in the status of "RUNNING". I get such a mistake when working with the HIVE tables. But also, I managed to reproduce the problem, launching the usual calculation of the number of pi in many subsidiaries, imitating the load. I launch an Oozie workflow with the following structure: {code:java} -- Oozie workflow ------> subworkflow_1 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n ------> subworkflow_2 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n {code} One of the fork have status "RUNNING" but if you open this fork, then it has "SUCCESS" status. Parent workflow: {code:java} Job ID : 0061971-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf/job Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-25 15:55 GMT Started : 2024-01-25 15:55 GMT Last Modified : 2024-01-30 06:24 GMT Ended : - CoordAction ID: -Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork7 OK 0067643-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork9 OK 0067640-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork10 RUNNING 0067641-240125161152217-oozie-oozi-WRUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0061971-240125161152217-oozie-oozi-W@fork5 OK 0067645-240125161152217-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ {code} Running subworkflow: {code:java} Job ID : 0067641-240125161152217-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : test-subworkflow App Path : hdfs://mycluster:8020/user/cecyl/subwf Status : RUNNING Run : 0 User : cecyl Group : - Created : 2024-01-26 04:20 GMT Started : 2024-01-26 04:20 GMT Last Modified : 2024-01-26 08:23 GMT Ended : - CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork21 RUNNING application_1706187939089_147514RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork22 RUNNING application_1706187939089_147519RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ 0067641-240125161152217-oozie-oozi-W@fork18 RUNNING application_1706187939089_147518RUNNING - ------------------------------------------------------------------------------------------------------------------------------------ {code} But, running app have state "SUCCEEDED" and "FINISHED" {code:java} Application Report : Application-Id : application_1706187939089_147514 Application-Name : oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W Application-Type : Oozie Launcher User : cecyl Queue : default Application Priority : 0 Start-Time : 1706259786568 Finish-Time : 1706259853156 Progress : 100% State : FINISHED Final-State : SUCCEEDED {code} The problem began to appear more often after tuning the HA. Solving the problem - reducing the load and restarting the application. But such a solution is not normal for me. There are no signs in the laying and server logs that something is going wrong. Someone has ideas why such behavior can appear? > Subsidiaries freeze in the status of "RUNNING" during a high load on the > cluster > -------------------------------------------------------------------------------- > > Key: OOZIE-3721 > URL: https://issues.apache.org/jira/browse/OOZIE-3721 > Project: Oozie > Issue Type: Bug > Components: core > Affects Versions: 5.2.0 > Reporter: Cecily Myles > Priority: Blocker > > When my cluster is loaded, I am faced with the problem of hanging > subsidiaries in the status of "RUNNING". I get such a mistake when working > with the HIVE tables. But also, I managed to reproduce the problem, launching > the usual calculation of the number of pi in many subsidiaries, imitating the > load. > I launch an Oozie workflow with the following structure: > {code:java} > -- Oozie workflow > ------> subworkflow_1 > ---------- fork_1 > ---------- fork_2 > ---------- ... > ---------- fork_n > ------> subworkflow_2 > ---------- fork_1 > ---------- fork_2 > ---------- ... > ---------- fork_n {code} > One of the fork have status "RUNNING" but if you open this fork, then it has > "SUCCESS" status. > Parent workflow: > {code:java} > Job ID : 0061971-240125161152217-oozie-oozi-W > ------------------------------------------------------------------------------------------------------------------------ > Workflow Name : test-subworkflow > App Path : hdfs://mycluster:8020/user/cecyl/subwf/job > Status : RUNNING > Run : 0 > User : cecyl > Group : - > Created : 2024-01-25 15:55 GMT > Started : 2024-01-25 15:55 GMT > Last Modified : 2024-01-30 06:24 GMT > Ended : - > CoordAction ID: -Actions > ------------------------------------------------------------------------------------------------------------------------- > ID Status Ext ID > Ext Status Err Code > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@:start: OK - > OK - > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@fork OK - > OK - > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@fork7 OK > 0067643-240125161152217-oozie-oozi-WSUCCEEDED - > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@fork9 OK > 0067640-240125161152217-oozie-oozi-WSUCCEEDED - > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@fork10 RUNNING > 0067641-240125161152217-oozie-oozi-WRUNNING - > ------------------------------------------------------------------------------------------------------------------------- > 0061971-240125161152217-oozie-oozi-W@fork5 OK > 0067645-240125161152217-oozie-oozi-WSUCCEEDED - > ------------------------------------------------------------------------------------------------------------------------- > {code} > Running subworkflow: > {code:java} > Job ID : 0067641-240125161152217-oozie-oozi-W > ------------------------------------------------------------------------------------------------------------------------------------ > Workflow Name : test-subworkflow > App Path : hdfs://mycluster:8020/user/cecyl/subwf > Status : RUNNING > Run : 0 > User : cecyl > Group : - > Created : 2024-01-26 04:20 GMT > Started : 2024-01-26 04:20 GMT > Last Modified : 2024-01-26 08:23 GMT > Ended : - > CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions > ------------------------------------------------------------------------------------------------------------------------- > ID Status Ext ID > Ext Status Err Code > ------------------------------------------------------------------------------------------------------------------------- > 0067641-240125161152217-oozie-oozi-W@:start: OK - > OK - > ------------------------------------------------------------------------------------------------------------------------- > 0067641-240125161152217-oozie-oozi-W@fork OK - > OK - > ------------------------------------------------------------------------------------------------------------------------- > 0067641-240125161152217-oozie-oozi-W@fork21 RUNNING > application_1706187939089_147514RUNNING - > ------------------------------------------------------------------------------------------------------------------------- > 0067641-240125161152217-oozie-oozi-W@fork22 RUNNING > application_1706187939089_147519RUNNING - > ------------------------------------------------------------------------------------------------------------------------- > 0067641-240125161152217-oozie-oozi-W@fork18 RUNNING > application_1706187939089_147518RUNNING - > -------------------------------------------------------------------------------------------------------------------------{code} > But, running app have state "SUCCEEDED" and "FINISHED" > {code:java} > Application Report : > Application-Id : application_1706187939089_147514 > Application-Name : > oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W > Application-Type : Oozie Launcher > User : cecyl > Queue : default > Application Priority : 0 > Start-Time : 1706259786568 > Finish-Time : 1706259853156 > Progress : 100% > State : FINISHED > Final-State : SUCCEEDED {code} > The problem began to appear more often after tuning the HA. Solving the > problem - reducing the load and restarting the application. But such a > solution is not normal for me. > There are no signs in the laying and server logs that something is going > wrong. Someone has ideas why such behavior can appear? -- This message was sent by Atlassian Jira (v8.20.10#820010)