[
https://issues.apache.org/jira/browse/HBASE-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962566#comment-13962566
]
stack commented on HBASE-10922:
-------------------------------
The first mention of the stuck log is because of:
122816 2014-04-06 19:50:22,096 INFO [M_LOG_REPLAY_OPS-e1120:36020-9]
master.SplitLogManager: started splitting 17 logs in
[hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-split
ting]
The task is acquired by e1121.halxg.cloudera.com,36020,1396838094934
Over on e1121, I see the task acquired, the lease successfully obtained and
then nothing else. We are silent hereafter on events related to the splitting
of 1396838053131.1396838130980
Over on master we wait and then resubmit:
124174 2014-04-06 19:52:23,291 INFO
[e1120.halxg.cloudera.com,36020,1396838229159.splitLogManagerTimeoutMonitor]
master.SplitLogManager: resubmitting task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%
2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
The log gets assigned to the master (combined master + regionserver).
We then crash at Sun Apr 6 19:52:52 PDT 2014 Starting master on
e1120.halxg.cloudera.com
When master comes back up, it finds the orphan'd log splitting. Requeues the
task. Complains about znode already existing. Then spends about tens of
seconds note being acquired. Then gets resubmitted. Its not being split fast
enough? Splitting is going really slow? Resubmitted a few times.
Eventually we get this:
2014-04-06 20:25:54,708 INFO
[e1120.halxg.cloudera.com,36020,1396839174186.splitLogManagerTimeoutMonitor]
master.SplitLogManager: Skipping resubmissions of task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
because threshold 3 reached
Should this be WARN? Or ERROR?
Seems like we keep going though, trying to split. Maybe the above means we
won't retry against another server anymore? We seem to get stuck waiting on
the one server to return.
Is the IOE we're talking of above, this one:
{code}
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region
d900a9b5ab41c96cf9ad0fb5f039fa8b is not online on
e1418.halxg.cloudera.com,36020,1396841264248
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2454)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2436)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1295)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20056)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2002)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:162)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:724)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1499)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1726)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replay(AdminProtos.java:20926)
at
org.apache.hadoop.hbase.regionserver.wal.WALEditsReplaySink$ReplayServerCallable.replayToServer(WALEditsReplaySink.java:215)
... 10 more
{code}
... where the replay gets a NSRE?
We keep retrying... I see one of these:
{code}
2014-04-06 20:39:30,582 WARN [RS_LOG_REPLAY_OPS-e1120:36020-1]
regionserver.SplitLogWorker: BADVERSION failed to assert ownership for
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:407)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.attemptToOwnTask(SplitLogWorker.java:393)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:460)
at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWriting(HLogSplitter.java:954)
at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogReplayOutputSink.finishWritingAndClose(HLogSplitter.java:1683)
at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:360)
at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:224)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:143)
at
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:79)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
2014-04-06 20:39:30,583 WARN [RS_LOG_REPLAY_OPS-e1120:36020-1]
regionserver.SplitLogWorker: Failed to heartbeat the
task/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
2014-04-06 20:39:30,583 WARN [RS_LOG_REPLAY_OPS-e1120:36020-1]
regionserver.SplitLogWorker: log splitting of
WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
failed, returning error
and this.....
2014-04-06 20:39:30,590 WARN [RS_LOG_REPLAY_OPS-e1120:36020-1]
handler.HLogSplitterHandler: transisition task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
to ERR e1120.halxg.cloudera.com,36020,1396839174186 failed because of version
mismatch
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:407)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:878)
at
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.endTask(HLogSplitterHandler.java:122)
at
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:91)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
...
2014-04-06 20:39:30,590 INFO [RS_LOG_REPLAY_OPS-e1120:36020-1]
handler.HLogSplitterHandler: worker
e1120.halxg.cloudera.com,36020,1396839174186 done with task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
in 2788095ms
{code}
This seems like an old running task because we are still trying against e1321.
Server is stuck waiting on 4 tasks to complete including this particular log
file.
It eventually fails after 40 odd minutes on e1321
2014-04-06 21:08:53,346 INFO [main-EventThread] master.SplitLogManager: task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
entered state: ERR e1321.halxg.cloudera.com,36020,1396841215304
Then we are able to split it on the master.
On e1321 I see this:
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region
d900a9b5ab41c96cf9ad0fb5f039fa8b is not online on
e1418.halxg.cloudera.com,36020,1396841264248
.... just before the error out on log splitting. Maybe this is the cause of
our erroring out? We are retrying this region over and over?
d900a9b5ab41c96cf9ad0fb5f039fa8b is online but can see folks are confused about
where it is.
Does DLR update its location if it fails to find region in presumed location?
> Log splitting status should always be closed
> --------------------------------------------
>
> Key: HBASE-10922
> URL: https://issues.apache.org/jira/browse/HBASE-10922
> Project: HBase
> Issue Type: Bug
> Components: wal
> Reporter: Jimmy Xiang
> Assignee: Jimmy Xiang
> Priority: Minor
> Attachments: log-splitting_hang.png, master-log-grep.txt,
> master.log.gz
>
>
> With distributed log replay enabled by default, I ran into an issue that log
> splitting hasn't completed after 13 hours. It seems to hang somewhere.
--
This message was sent by Atlassian JIRA
(v6.2#6252)