[ 
https://issues.apache.org/jira/browse/HBASE-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain resolved HBASE-29041.
--------------------------------
    Resolution: Fixed

Pushed to branch=2.5+. Thanks [~nirdosh.yadav]  for reporting, [~dishavaidh] 
for providing a fix and [~pankajkumar] and [~zhangduo] for helping with reviews.

> Set UncaughtException Handler for RegionServer ExecutorService
> --------------------------------------------------------------
>
>                 Key: HBASE-29041
>                 URL: https://issues.apache.org/jira/browse/HBASE-29041
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 3.0.0, 2.6.1, 2.5.10
>            Reporter: Nirdosh Kumar Yadav
>            Assignee: Disha Vaidh
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> In an HBase cluster, we encountered a scenario where the RegionServer crash 
> procedure (SCP) took for more than 3 hours to complete. The incident was 
> triggered by temporary network unavailability in the HBase cluster. Upon 
> debugging, we found that the SCP was stuck due to the child 
> SplitWALProcedure, which was waiting for the completion of the SplitWALRemote 
> procedure by the RegionServer worker.
> While executing, the SplitWALRemote procedure encountered an unknown 
> exception. The logs show an error message: "hdfs.DataStreamer - No ack 
> received," indicating an issue with the RegionServer's connection to the 
> DataNode. After this error, the thread either became stuck or terminated, as 
> no related logs were available.
> During this period, inconsistent regions were reported. Worker RS was healthy 
> and didn't reported any signs of abort or unhealthy. All procedures were 
> restarted and completed successfully after the Active HMaster service was 
> bounced.
> We might need to setUncaughtExecptionHandler 
> [here|[link|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java#L255]]
> Related logs:
> [HMASTER-4]
> 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[
> {pid=6003288, ppid=6002575, state=RUNNABLE; 
> SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx. 
> %2Cxxxxx%2C1730886070174.1733410178680, 
> worker=regionserver-15.regionserver.xxxxx,1730878238028}
> ]
> [RS-15]
> 2024-12-05 14:55:11,461 DEBUG 
> [iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] 
> hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File 
> being written: 
> /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp,
>  block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451, 
> Write pipeline datanodes: 
> [DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
>  
> DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
> [HMASTER-1]
> 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] 
> store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575, 
> class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack 
> ids=[3592]
> [HMASTER-1]
> 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575, 
> state=RUNNABLE; SplitWALRemoteProcedure 
> regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028
> [RS-15]
> 2024-12-05 18:11:42,769 DEBUG 
> [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 18:11:48,247 DEBUG 
> [_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192] 
> regionserver.RemoteProcedureResultReporter - Successfully complete execution 
> of pid=6003288
> [HMASTER-1]
> 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor - 
> Finished pid=6002575, ppid=6000775, state=SUCCESS; 
> SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028
>  in 3 hrs, 16 mins, 52.806 sec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to