[
https://issues.apache.org/jira/browse/HBASE-21316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
justice updated HBASE-21316:
----------------------------
Labels: easyfix (was: )
Attachment: 0001-add-catch-for-ArrayIndexOutOfBoundsException-when-ch.patch
Tags: HBASE-21316,WALSplitter
Status: Patch Available (was: Open)
> All RegionServer Down when RS_LOG_REPLAY_OPS
> --------------------------------------------
>
> Key: HBASE-21316
> URL: https://issues.apache.org/jira/browse/HBASE-21316
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 2.0.0
> Reporter: justice
> Priority: Major
> Labels: easyfix
> Attachments:
> 0001-add-catch-for-ArrayIndexOutOfBoundsException-when-ch.patch, log.tgz
>
>
> 1. One RegionServer die as unknow reason, log as follow:
> {code:java}
> 2018-10-14 20:31:47,423 INFO [main-SendThread(11.3.20.101:2181)]
> zookeeper.ClientCnxn: Socket connection established to
> 11.3.20.101/11.3.20.101:2181, initiating session 2018-10-14 20:31:47,433 INFO
> [main-SendThread(11.3.20.101:2181)] zookeeper.ClientCnxn: Session
> establishment complete on server 11.3.20.101/11.3.20.101:2181, sessionid =
> 0x6500073f944a8e79, negotiated timeout = 30000 2018-10-14日 Sunday 21:03:05
> CST Starting regionserver on 11-3-19-199.JD.LOCAL core file size (blocks, -c)
> 0 data seg size (kbytes, -d) unlimited
> {code}
> 2. Master receive zk deletenode event, and start ServerCrashProcedure Task
> {code:java}
> 2018-10-14 20:31:47,437 INFO [main-EventThread] master.RegionServerTracker:
> RegionServer ephemeral node deleted, processing expiration
> [11-3-19-199.jd.local,16020,1539492869470]
> 2018-10-14 20:31:47,539 INFO [PEWorker-1] procedure.ServerCrashProcedure:
> Start pid=25053, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=11-3-19-199.jd.local,16020,1539492869470, splitWal=true, meta=false
> 2018-10-14 20:31:47,550 INFO [PEWorker-1] master.SplitLogManager: Started
> splitting 63 logs in
> [hdfs://11-3-18-67.JD.LOCAL:9000/hbase/WALs/11-3-19-199.jd.local,16020,1539492869470-splitting]
> for [11-3-19-199.jd.local,16020,1539492869470] ... 2018-10-14 20:31:48,592
> INFO [main-EventThread] coordination.SplitLogManagerCoordination: Task
> /hbase/splitWAL/WALs%2F11-3-19-199.jd.local%2C16020%2C1539492869470-splitting%2F11-3-19-199.jd.local%252C16020%252C1539492869470.1539520250598
> acquired by 11-3-18-71.jd.local,16020,1539492869409
> {code}
> 3. One alive RegionServer Node get SplitLogWorker, has an error and stop
> {code:java}
> 2018-10-14 20:31:48,602 INFO [SplitLogWorker-11-3-18-71:16020]
> coordination.ZkSplitLogWorkerCoordination: worker
> 11-3-18-71.jd.local,16020,1539492869409 acquired task
> /hbase/splitWAL/WALs%2F11-3-19-199.jd.local%2C16020%2C1539492869470-splitting%2F11-3-19-199.jd.local%252C16020%252C1539492869470.1539520250598
>
> ...
> 2018-10-14 21:03:26,219 ERROR
> [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-1] executor.EventHandler:
> Caught throwable while processing event RS_LOG_REPLAY
> java.lang.ArrayIndexOutOfBoundsException: 8811
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:102)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:107)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:296)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:194)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:99)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-10-14 21:03:26,227 ERROR
> [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-1]
> regionserver.HRegionServer: ***** ABORTING region server
> 11-3-18-71.jd.local,16020,1539522186368: Caught throwable while processing
> event RS_LOG_REPLAY *****
> ....
> 2018-10-14 20:31:48,780 INFO
> [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-0]
> regionserver.HRegionServer: ***** STOPPING region server
> '11-3-18-71.jd.local,16020,1539492869409' *****
> {code}
> 4. other alive regionserver node die one by one, at last, all regionserver
> node die
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)