[ 
https://issues.apache.org/jira/browse/HBASE-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962566#comment-13962566
 ] 

stack commented on HBASE-10922:
-------------------------------

The first mention of the stuck log is because of:

122816 2014-04-06 19:50:22,096 INFO  [M_LOG_REPLAY_OPS-e1120:36020-9] 
master.SplitLogManager: started splitting 17 logs in 
[hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-split
       ting]

The task is acquired by e1121.halxg.cloudera.com,36020,1396838094934

Over on e1121, I see the task acquired, the lease successfully obtained and 
then nothing else.  We are silent hereafter on events related to the splitting 
of 1396838053131.1396838130980

Over on master we wait and then resubmit:

124174 2014-04-06 19:52:23,291 INFO  
[e1120.halxg.cloudera.com,36020,1396838229159.splitLogManagerTimeoutMonitor] 
master.SplitLogManager: resubmitting task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%       
2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980

The log gets assigned to the master (combined master + regionserver).

We then crash at Sun Apr  6 19:52:52 PDT 2014 Starting master on 
e1120.halxg.cloudera.com

When master comes back up, it finds the orphan'd log splitting.  Requeues the 
task.   Complains about znode already existing.  Then spends about tens of 
seconds note being acquired. Then gets resubmitted.  Its not being split fast 
enough?  Splitting is going really slow?  Resubmitted a few times.

Eventually we get this:

2014-04-06 20:25:54,708 INFO  
[e1120.halxg.cloudera.com,36020,1396839174186.splitLogManagerTimeoutMonitor] 
master.SplitLogManager: Skipping resubmissions of task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 because threshold 3 reached

Should this be WARN?  Or ERROR?

Seems like we keep going though, trying to split.  Maybe the above means we 
won't retry against another server anymore?  We seem to get stuck waiting on 
the one server to return.

Is the IOE we're talking of above, this one:

{code}
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
 org.apache.hadoop.hbase.NotServingRegionException: Region 
d900a9b5ab41c96cf9ad0fb5f039fa8b is not online on 
e1418.halxg.cloudera.com,36020,1396841264248
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2454)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2436)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1295)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20056)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2002)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:162)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:724)

        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1499)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1726)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replay(AdminProtos.java:20926)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALEditsReplaySink$ReplayServerCallable.replayToServer(WALEditsReplaySink.java:215)
        ... 10 more
{code}

... where the replay gets a NSRE?

We keep retrying... I see one of these:

{code}
2014-04-06 20:39:30,582 WARN  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
regionserver.SplitLogWorker: BADVERSION failed to assert ownership for 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:407)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.attemptToOwnTask(SplitLogWorker.java:393)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:460)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWriting(HLogSplitter.java:954)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogReplayOutputSink.finishWritingAndClose(HLogSplitter.java:1683)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:360)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:224)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:143)
        at 
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:79)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
2014-04-06 20:39:30,583 WARN  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
regionserver.SplitLogWorker: Failed to heartbeat the 
task/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
2014-04-06 20:39:30,583 WARN  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
regionserver.SplitLogWorker: log splitting of 
WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
 failed, returning error


and this.....


2014-04-06 20:39:30,590 WARN  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
handler.HLogSplitterHandler: transisition task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 to ERR e1120.halxg.cloudera.com,36020,1396839174186 failed because of version 
mismatch
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:407)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:878)
        at 
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.endTask(HLogSplitterHandler.java:122)
        at 
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:91)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

...

2014-04-06 20:39:30,590 INFO  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
handler.HLogSplitterHandler: worker 
e1120.halxg.cloudera.com,36020,1396839174186 done with task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 in 2788095ms
{code}

This seems like an old running task because we are still trying against e1321.

Server is stuck waiting on 4 tasks to complete including this particular log 
file.

It eventually fails after 40 odd minutes on e1321

2014-04-06 21:08:53,346 INFO  [main-EventThread] master.SplitLogManager: task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 entered state: ERR e1321.halxg.cloudera.com,36020,1396841215304

Then we are able to split it on the master.

On e1321 I see this:

Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
 org.apache.hadoop.hbase.NotServingRegionException: Region 
d900a9b5ab41c96cf9ad0fb5f039fa8b is not online on 
e1418.halxg.cloudera.com,36020,1396841264248

.... just before the error out on log splitting.  Maybe this is the cause of 
our erroring out? We are retrying this region over and over?

d900a9b5ab41c96cf9ad0fb5f039fa8b is online but can see folks are confused about 
where it is.

Does DLR update its location if it fails to find region in presumed location?





























> Log splitting status should always be closed
> --------------------------------------------
>
>                 Key: HBASE-10922
>                 URL: https://issues.apache.org/jira/browse/HBASE-10922
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>            Priority: Minor
>         Attachments: log-splitting_hang.png, master-log-grep.txt, 
> master.log.gz
>
>
> With distributed log replay enabled by default, I ran into an issue that log 
> splitting hasn't completed after 13 hours. It seems to hang somewhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to