[ https://issues.apache.org/jira/browse/HBASE-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nihal Jain resolved HBASE-29041. -------------------------------- Resolution: Fixed Pushed to branch=2.5+. Thanks [~nirdosh.yadav] for reporting, [~dishavaidh] for providing a fix and [~pankajkumar] and [~zhangduo] for helping with reviews. > Set UncaughtException Handler for RegionServer ExecutorService > -------------------------------------------------------------- > > Key: HBASE-29041 > URL: https://issues.apache.org/jira/browse/HBASE-29041 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 3.0.0, 2.6.1, 2.5.10 > Reporter: Nirdosh Kumar Yadav > Assignee: Disha Vaidh > Priority: Minor > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12 > > > In an HBase cluster, we encountered a scenario where the RegionServer crash > procedure (SCP) took for more than 3 hours to complete. The incident was > triggered by temporary network unavailability in the HBase cluster. Upon > debugging, we found that the SCP was stuck due to the child > SplitWALProcedure, which was waiting for the completion of the SplitWALRemote > procedure by the RegionServer worker. > While executing, the SplitWALRemote procedure encountered an unknown > exception. The logs show an error message: "hdfs.DataStreamer - No ack > received," indicating an issue with the RegionServer's connection to the > DataNode. After this error, the thread either became stuck or terminated, as > no related logs were available. > During this period, inconsistent regions were reported. Worker RS was healthy > and didn't reported any signs of abort or unhealthy. All procedures were > restarted and completed successfully after the Active HMaster service was > bounced. > We might need to setUncaughtExecptionHandler > [here|[link|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java#L255]] > Related logs: > [HMASTER-4] > 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor - > Initialized subprocedures=[ > {pid=6003288, ppid=6002575, state=RUNNABLE; > SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx. > %2Cxxxxx%2C1730886070174.1733410178680, > worker=regionserver-15.regionserver.xxxxx,1730878238028} > ] > [RS-15] > 2024-12-05 14:55:11,461 DEBUG > [iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx] > regionserver.RSRpcServices - Executing remote procedure > classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288 > [RS-15] > 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] > hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File > being written: > /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp, > block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451, > Write pipeline datanodes: > [DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD], > > DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]]. > [HMASTER-1] > 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] > store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575, > class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack > ids=[3592] > [HMASTER-1] > 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] > procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575, > state=RUNNABLE; SplitWALRemoteProcedure > regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680, > > worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028 > [RS-15] > 2024-12-05 18:11:42,769 DEBUG > [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] > regionserver.RSRpcServices - Executing remote procedure > classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288 > [RS-15] > 2024-12-05 18:11:48,247 DEBUG > [_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192] > regionserver.RemoteProcedureResultReporter - Successfully complete execution > of pid=6003288 > [HMASTER-1] > 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor - > Finished pid=6002575, ppid=6000775, state=SUCCESS; > SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680, > > worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028 > in 3 hrs, 16 mins, 52.806 sec -- This message was sent by Atlassian Jira (v8.20.10#820010)