[ 
https://issues.apache.org/jira/browse/HBASE-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nirdosh Kumar Yadav updated HBASE-29041:
----------------------------------------
    Description: 
In HBase cluster we have encountered a scenario where regionserver server crash 
procedure(SCP) waited for more than 3 Hours. Incident was triggered due to 
temporary network unavailability in hbase cluster. On Debugging found out SCP 
was stuck due to  child {{SplitWALProcedure}} which was waiting for completion 
of SpliWalRemote procedure by regionserver worker.  SplitWALRemote procedure 
while running encountered{{{{}} an unknown exception. In logs we can see 
*"hdfs*{}}}{*}{{{}.{}}}{{{}DataStreamer - {{}}}}{{{}No ack 
}}\{{{}receive{}}}{{{}d{}}}"{*} error while regionserver connecting to Data 
Node. After this error thread was stuck or died as there was no related logs 
exists{{{}. There were inconsistent regions reported during this period. All 
procedure were restarted and completed after Active HMaster service was 
bounced. {}}}

Related logs:

[HMASTER-4]
2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor - 
Initialized subprocedures=[\{pid=6003288, ppid=6002575, state=RUNNABLE; 
SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
 
worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028}]

[RS-15]
2024-12-05 14:55:11,461 DEBUG 
[iority.RWQ.Fifo.read.handler=83,queue=2,port=60020] regionserver.RSRpcServices 
- Executing remote procedure 
classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288


[RS-15]
2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] hdfs.DataStreamer 
- No ack received, took 25002ms (threshold=25000ms). File being written: 
/hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680.temp,
 block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451, 
Write pipeline datanodes: 
[DatanodeInfoWithStorage[10.60.52.107:50010,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
 
DatanodeInfoWithStorage[10.60.75.52:50010,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].

[HMASTER-1]
2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:60000:becomeActiveMaster] 
store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575, 
class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack 
ids=[3592]

[HMASTER-1]
2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:60000:becomeActiveMaster] 
procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575, 
state=RUNNABLE; SplitWALRemoteProcedure 
regionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
 
worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028

[RS-15]
2024-12-05 18:11:42,769 DEBUG 
[iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] regionserver.RSRpcServices 
- Executing remote procedure 
classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288

[RS-15]
2024-12-05 18:11:48,247 DEBUG 
[_REPLAY_OPS-regionserver/regionserver-15:60020-192] 
regionserver.RemoteProcedureResultReporter - Successfully complete execution of 
pid=6003288


[HMASTER-1]
2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor - 
Finished pid=6002575, ppid=6000775, state=SUCCESS; 
SplitWALProcedureregionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
 
worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028
 in 3 hrs, 16 mins, 52.806 sec

  was:
In HBase cluster we have encountered a scenario where regionserver server crash 
procedure(SCP) waited for more than 3 Hours. Incident was triggered due to 
temporary network unavailability in hbase cluster. On Debugging found out SCP 
was stuck due to  child {{SplitWALProcedure}} which was waiting for completion 
of SpliWalRemote procedure by regionserver worker.  SplitWALRemote procedure 
while running encountered{{{{}} an unknown exception. In logs we can see 
*"hdfs*{}}}{*}{{{}.{}}}{{{}DataStreamer - {{}}}}{{{}No ack 
}}\{{{}receive{}}}{{{}d{}}}"{*} error while regionserver connecting to Data 
Node. After this error thread was stuck or died as there was no related logs 
exists{{{}. There were inconsistent regions reported during this period. All 
procedure were restarted and completed after Active HMaster service was 
bounced. {}}}

Related logs:
{quote}{color:#000000}[HMASTER-4]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}14{}}}{{{}:{}}}{{{}55{}}}{{{}:{}}}{{{}11{}}}{{{},{}}}{{{}264{}}}{{ 
}}{{INFO}}{{{} [{}}}{{{}PEWorker{}}}{{{}-{}}}{{{}41{}}}{{{}] 
{}}}{{{}procedure2{}}}{{{}.{}}}{{{}ProcedureExecutor{}}}{{ }}{{-}}{{ 
}}{{Initialized}}{{ 
}}{{{}subprocedures{}}}{{{}=[{{}}}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}{{{}, 
{}}}{{{}ppid{}}}{{{}={}}}{{{}6002575{}}}{{{}, 
{}}}{{{}state{}}}{{{}={}}}{{{}RUNNABLE{}}}{{{}; 
{}}}{{{}SplitWALRemoteProcedure{}}}{{{}{}}}{{{}regionserver{}}}{{{}-{}}}{{{}53{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{}%2C{}}}{{{}60020{}}}{{{}%2C{}}}{{{}1730886070174{}}}{{{}.{}}}{{{}1733410178680{}}}{{{},
 
{}}}{{{}worker{}}}{{{}={}}}{{{}regionserver{}}}{{{}-{}}}{{{}15{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{},{}}}{{{}60020{}}}{{{},{}}}{{{}1730878238028{}}}{{{}}]{}}}

{color:#000000}[RS-15]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}14{}}}{{{}:{}}}{{{}55{}}}{{{}:{}}}{{{}11{}}}{{{},{}}}{{{}461{}}}{{ 
}}{{DEBUG}}{{{} 
[{}}}{{{}iority{}}}{{{}.{}}}{{{}RWQ{}}}{{{}.{}}}{{{}Fifo{}}}{{{}.{}}}{{{}read{}}}{{{}.{}}}{{{}handler{}}}{{{}={}}}{{{}83{}}}{{{},{}}}{{{}queue{}}}{{{}={}}}{{{}2{}}}{{{},{}}}{{{}port{}}}{{{}={}}}{{{}60020{}}}{{{}]
 {}}}{{{}regionserver{}}}{{{}.{}}}{{{}RSRpcServices{}}}{{ }}{{-}}{{ 
}}{{Executing}}{{ }}{{remote}}{{ }}{{procedure}}{{ 
}}{{{}class{}}}{{{}{}}}{{{}org{}}}{{{}.{}}}{{{}apache{}}}{{{}.{}}}{{{}hadoop{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}SplitWALCallable{}}}{{{},
 {}}}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}


{color:#000000}[RS-15]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}14{}}}{{{}:{}}}{{{}55{}}}{{{}:{}}}{{{}54{}}}{{{},{}}}{{{}689{}}}{{ 
}}{{ERROR}}{{{} 
[{}}}{{{}split{}}}{{{}-{}}}{{{}log{}}}{{{}-{}}}{{{}closeStream{}}}{{{}-{}}}{{{}pool{}}}{{{}-{}}}{{{}0{}}}{{{}]
 {}}}{{{}hdfs{}}}{{{}.{}}}{{{}DataStreamer{}}}{{ }}{{-}}{{ }}{{No}}{{ 
}}{{ack}}{{ }}{{{}received{}}}{{{}, {}}}{{took}}{{ }}{{25002ms}}{{{} 
({}}}{{{}threshold{}}}{{{}={}}}{{{}25000ms{}}}{{{}){}}}{{{}.{}}}{{ }}{{File}}{{ 
}}{{being}}{{ }}{{{}written{}}}{{{}: 
/{}}}{{{}hbase{}}}{{{}/{}}}{{{}data{}}}{{{}/{}}}{{{}default{}}}{{{}/{}}}{{{}tsdb{}}}{{{}/{}}}{{{}c997c5f8dd36481dcd3ebb9b79a35b51{}}}{{{}/{}}}{{{}recovered{}}}{{{}.{}}}{{{}edits{}}}{{{}/{}}}{{{}0000000000539451088{}}}{{{}-{}}}{{{}regionserver{}}}{{{}-{}}}{{{}53{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{}%2C{}}}{{{}60020{}}}{{{}%2C{}}}{{{}1730886070174{}}}{{{}.{}}}{{{}1733410178680{}}}{{{}.{}}}{{{}temp{}}}{{{},
 {}}}{{{}block{}}}{{{}: 
{}}}{{{}BP{}}}{{{}-{}}}{{{}1745262640{}}}{{{}-{}}}{{{}10{}}}{{{}.{}}}{{{}60{}}}{{{}.{}}}{{{}130{}}}{{{}.{}}}{{{}13{}}}{{{}-{}}}{{{}1712173738392{}}}{{{}:{}}}{{{}blk{}}}{{{}_{}}}{{{}1330710217{}}}{{{}_{}}}{{{}257120451{}}}{{{},
 {}}}{{Write}}{{ }}{{pipeline}}{{ }}{{{}datanodes{}}}{{{}: 
[{}}}{{{}DatanodeInfoWithStorage{}}}{{{}[{}}}{{{}10{}}}{{{}.{}}}{{{}60{}}}{{{}.{}}}{{{}52{}}}{{{}.{}}}{{{}107{}}}{{{}:{}}}{{{}50010{}}}{{{},{}}}{{{}DS{}}}{{{}-{}}}{{{}f2b7ba1a{}}}{{{}-{}}}{{{}68b5{}}}{{{}-{}}}{{{}433a{}}}{{{}-{}}}{{{}9fe8{}}}{{{}-{}}}{{{}99315a172098{}}}{{{},{}}}{{{}SSD{}}}{{{}],
 
{}}}{{{}DatanodeInfoWithStorage{}}}{{{}[{}}}{{{}10{}}}{{{}.{}}}{{{}60{}}}{{{}.{}}}{{{}75{}}}{{{}.{}}}{{{}52{}}}{{{}:{}}}{{{}50010{}}}{{{},{}}}{{{}DS{}}}{{{}-{}}}{{{}93a433be{}}}{{{}-{}}}{{{}972f{}}}{{{}-{}}}{{{}4457{}}}{{{}-{}}}{{{}92ae{}}}{{{}-{}}}{{{}dd07288e41b5{}}}{{{},{}}}{{{}SSD{}}}{{{}]].{}}}

{color:#000000}[HMASTER-1]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}18{}}}{{{}:{}}}{{{}11{}}}{{{}:{}}}{{{}42{}}}{{{},{}}}{{{}036{}}}{{ 
}}{{DEBUG}}{{{} 
[{}}}{{{}master{}}}{{{}/{}}}{{{}hmaster{}}}{{{}-{}}}{{{}1{}}}{{{}:{}}}{{{}60000{}}}{{{}:{}}}{{{}becomeActiveMaster{}}}{{{}]
 {}}}{{{}store{}}}{{{}.{}}}{{{}ProcedureTree{}}}{{ 
}}{{{}-{}}}{{{}{}}}{{{}Procedure{}}}{{ 
}}{{{}Procedure{}}}{{{}({}}}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}{{{}, 
{}}}{{{}ppid{}}}{{{}={}}}{{{}6002575{}}}{{{}, 
{}}}{{{}class{}}}{{{}={}}}{{{}org{}}}{{{}.{}}}{{{}apache{}}}{{{}.{}}}{{{}hadoop{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}master{}}}{{{}.{}}}{{{}procedure{}}}{{{}.{}}}{{{}SplitWALRemoteProcedure{}}}{{{})
 {}}}{{stack}}{{ }}{{{}ids{}}}{{{}=[{}}}{{{}3592]{}}}

{color:#000000}[HMASTER-1]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}18{}}}{{{}:{}}}{{{}11{}}}{{{}:{}}}{{{}42{}}}{{{},{}}}{{{}214{}}}{{ 
}}{{DEBUG}}{{{} 
[{}}}{{{}master{}}}{{{}/{}}}{{{}hmaster{}}}{{{}-{}}}{{{}1{}}}{{{}:{}}}{{{}60000{}}}{{{}:{}}}{{{}becomeActiveMaster{}}}{{{}]
 {}}}{{{}procedure2{}}}{{{}.{}}}{{{}ProcedureExecutor{}}}{{ }}{{-}}{{ 
}}{{Loading}}{{ }}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}{{{}, 
{}}}{{{}ppid{}}}{{{}={}}}{{{}6002575{}}}{{{}, 
{}}}{{{}state{}}}{{{}={}}}{{{}RUNNABLE{}}}{{{}; 
{}}}{{SplitWALRemoteProcedure}}{{ 
}}{{{}regionserver{}}}{{{}-{}}}{{{}53{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{}%2C{}}}{{{}60020{}}}{{{}%2C{}}}{{{}1730886070174{}}}{{{}.{}}}{{{}1733410178680{}}}{{{},
 
{}}}{{{}worker{}}}{{{}={}}}{{{}regionserver{}}}{{{}-{}}}{{{}15{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{},{}}}{{{}60020{}}}{{{},{}}}{{{}1730878238028{}}}

{color:#000000}[RS-15]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}18{}}}{{{}:{}}}{{{}11{}}}{{{}:{}}}{{{}42{}}}{{{},{}}}{{{}769{}}}{{ 
}}{{DEBUG}}{{{} 
[{}}}{{{}iority{}}}{{{}.{}}}{{{}RWQ{}}}{{{}.{}}}{{{}Fifo{}}}{{{}.{}}}{{{}read{}}}{{{}.{}}}{{{}handler{}}}{{{}={}}}{{{}79{}}}{{{},{}}}{{{}queue{}}}{{{}={}}}{{{}7{}}}{{{},{}}}{{{}port{}}}{{{}={}}}{{{}60020{}}}{{{}]
 {}}}{{{}regionserver{}}}{{{}.{}}}{{{}RSRpcServices{}}}{{ }}{{-}}{{ 
}}{{Executing}}{{ }}{{remote}}{{ }}{{procedure}}{{ 
}}{{{}class{}}}{{{}{}}}{{{}org{}}}{{{}.{}}}{{{}apache{}}}{{{}.{}}}{{{}hadoop{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}SplitWALCallable{}}}{{{},
 {}}}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}

{color:#000000}[RS-15]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}18{}}}{{{}:{}}}{{{}11{}}}{{{}:{}}}{{{}48{}}}{{{},{}}}{{{}247{}}}{{ 
}}{{DEBUG}}{{{} 
[{}}}{{{}_{}}}{{{}REPLAY{}}}{{{}_{}}}{{{}OPS{}}}{{{}-{}}}{{{}regionserver{}}}{{{}/{}}}{{{}regionserver{}}}{{{}-{}}}{{{}15{}}}{{{}:{}}}{{{}60020{}}}{{{}-{}}}{{{}192{}}}{{{}]
 {}}}{{{}regionserver{}}}{{{}.{}}}{{{}RemoteProcedureResultReporter{}}}{{ 
}}{{-}}{{ }}{{Successfully}}{{ }}{{complete}}{{ }}{{execution}}{{ }}{{of}}{{ 
}}{{{}pid{}}}{{{}={}}}{{{}6003288{}}}


{color:#000000}[HMASTER-1]{color}
{{{}2024{}}}{{{}-{}}}{{{}12{}}}{{{}-{}}}{{{}05{}}}{{ 
}}{{{}18{}}}{{{}:{}}}{{{}11{}}}{{{}:{}}}{{{}48{}}}{{{},{}}}{{{}304{}}}{{ 
}}{{INFO}}{{{} [{}}}{{{}PEWorker{}}}{{{}-{}}}{{{}2{}}}{{{}] 
{}}}{{{}procedure2{}}}{{{}.{}}}{{{}ProcedureExecutor{}}}{{ }}{{-}}{{ 
}}{{Finished}}{{ }}{{{}pid{}}}{{{}={}}}{{{}6002575{}}}{{{}, 
{}}}{{{}ppid{}}}{{{}={}}}{{{}6000775{}}}{{{}, 
{}}}{{{}state{}}}{{{}={}}}{{{}SUCCESS{}}}{{{}; 
{}}}{{{}SplitWALProcedure{}}}{{{}{}}}{{{}regionserver{}}}{{{}-{}}}{{{}53{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{}%2C{}}}{{{}60020{}}}{{{}%2C{}}}{{{}1730886070174{}}}{{{}.{}}}{{{}1733410178680{}}}{{{},
 
{}}}{{{}worker{}}}{{{}={}}}{{{}regionserver{}}}{{{}-{}}}{{{}15{}}}{{{}.{}}}{{{}regionserver{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}hbase33a{}}}{{{}.{}}}{{{}hbase{}}}{{{}.{}}}{{{}monitoring{}}}{{{}.{}}}{{{}aws{}}}{{{}-{}}}{{{}esvc1{}}}{{{}-{}}}{{{}useast2{}}}{{{}.{}}}{{{}aws{}}}{{{}.{}}}{{{}sfdc{}}}{{{}.{}}}{{{}is{}}}{{{},{}}}{{{}60020{}}}{{{},{}}}{{{}1730878238028{}}}{{
 }}{{in}}{{ }}*{{3}}{{ }}{{{}hrs{}}}{{{}, {}}}{{16}}{{ }}{{{}mins{}}}{{{}, 
{}}}{{{}52{}}}{{{}.{}}}{{{}806{}}}{{ }}{{sec}}*
{quote}


> Set UncaughtException Handler for RegionServer ExecutorService
> --------------------------------------------------------------
>
>                 Key: HBASE-29041
>                 URL: https://issues.apache.org/jira/browse/HBASE-29041
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 3.0.0, 2.6.1, 2.5.10
>            Reporter: Nirdosh Kumar Yadav
>            Priority: Minor
>
> In HBase cluster we have encountered a scenario where regionserver server 
> crash procedure(SCP) waited for more than 3 Hours. Incident was triggered due 
> to temporary network unavailability in hbase cluster. On Debugging found out 
> SCP was stuck due to  child {{SplitWALProcedure}} which was waiting for 
> completion of SpliWalRemote procedure by regionserver worker.  SplitWALRemote 
> procedure while running encountered{{{{}} an unknown exception. In logs we 
> can see *"hdfs*{}}}{*}{{{}.{}}}{{{}DataStreamer - {{}}}}{{{}No ack 
> }}\{{{}receive{}}}{{{}d{}}}"{*} error while regionserver connecting to Data 
> Node. After this error thread was stuck or died as there was no related logs 
> exists{{{}. There were inconsistent regions reported during this period. All 
> procedure were restarted and completed after Active HMaster service was 
> bounced. {}}}
> Related logs:
> [HMASTER-4]
> 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[\{pid=6003288, ppid=6002575, state=RUNNABLE; 
> SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028}]
> [RS-15]
> 2024-12-05 14:55:11,461 DEBUG 
> [iority.RWQ.Fifo.read.handler=83,queue=2,port=60020] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] 
> hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File 
> being written: 
> /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680.temp,
>  block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451, 
> Write pipeline datanodes: 
> [DatanodeInfoWithStorage[10.60.52.107:50010,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
>  
> DatanodeInfoWithStorage[10.60.75.52:50010,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
> [HMASTER-1]
> 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:60000:becomeActiveMaster] 
> store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575, 
> class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack 
> ids=[3592]
> [HMASTER-1]
> 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:60000:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575, 
> state=RUNNABLE; SplitWALRemoteProcedure 
> regionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028
> [RS-15]
> 2024-12-05 18:11:42,769 DEBUG 
> [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 18:11:48,247 DEBUG 
> [_REPLAY_OPS-regionserver/regionserver-15:60020-192] 
> regionserver.RemoteProcedureResultReporter - Successfully complete execution 
> of pid=6003288
> [HMASTER-1]
> 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor - 
> Finished pid=6002575, ppid=6000775, state=SUCCESS; 
> SplitWALProcedureregionserver-53.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is%2C60020%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbase33a.hbase.monitoring.aws-esvc1-useast2.aws.sfdc.is,60020,1730878238028
>  in 3 hrs, 16 mins, 52.806 sec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to