[jira] [Commented] (HBASE-10922) Log splitting seems to hang

stack (JIRA) Mon, 07 Apr 2014 13:58:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962244#comment-13962244
 ]


stack commented on HBASE-10922:
-------------------------------

The log is eventually split about 95 minutes later:

{code}
2014-04-06 21:28:12,180 INFO  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
handler.HLogSplitterHandler: successfully transitioned task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 to final state DONE e1120.halxg.cloudera.com,36020,1396839174186
2014-04-06 21:28:12,180 INFO  [RS_LOG_REPLAY_OPS-e1120:36020-1] 
handler.HLogSplitterHandler: worker 
e1120.halxg.cloudera.com,36020,1396839174186 done with task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 in 1891ms
2014-04-06 21:28:12,181 INFO  [main-EventThread] master.SplitLogManager: task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 entered state: DONE e1120.halxg.cloudera.com,36020,1396839174186
2014-04-06 21:28:12,193 INFO  [main-EventThread] wal.HLogSplitter: Archived 
processed log 
hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
 to 
hdfs://e1120:35802/hbase/oldWALs/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
2014-04-06 21:28:12,194 INFO  [main-EventThread] master.SplitLogManager: Done 
splitting 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}

TODO: Why is task not removed from the UI list-of-tasks?

The master is restarted around 19:52. We have been having trouble w/ this log 
before the restart.

Here is some history on this log:

{code}
2014-04-06 19:50:22,222 DEBUG [main-EventThread] master.SplitLogManager: put up 
splitlog task at znode 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
...
2014-04-06 19:50:22,280 INFO  [main-EventThread] master.SplitLogManager: task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 acquired by e1121.halxg.cloudera.com,36020,1396838094934
...

{code}

It is acquired by a machine and it starts to work but then no more...

2014-04-06 19:50:22,305 INFO  [RS_LOG_REPLAY_OPS-e1121:36020-0] 
util.FSHDFSUtils: recoverLease=true, attempt=0 on 
file=hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
 after 1ms

... I don't see the "Processed N edits ..." for this file.  I see it for others.

We then 'resubmit' the task on the master:

{code}
2014-04-06 19:52:23,291 INFO  
[e1120.halxg.cloudera.com,36020,1396838229159.splitLogManagerTimeoutMonitor] 
master.SplitLogManager: resubmitting task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}

Master dies.  Comes back.  We find the outstanding 'orphan' task.  Requeue a 
splitting task.  We are waiting on it to complete:

{code}
2014-04-06 19:53:03,579 DEBUG [M_LOG_REPLAY_OPS-e1120:36020-2] 
master.SplitLogManager: Previously orphan task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 is now being waited upon
{code}

Then weirdly, we 'find' preexisting node and put up new task:

{code}
2014-04-06 19:53:03,592 DEBUG [main-EventThread] 
master.SplitLogManager$CreateAsyncCallback: found pre-existing znode 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
2014-04-06 19:53:03,593 DEBUG [main-EventThread] master.SplitLogManager: put up 
splitlog task at znode 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}


Then resubmit tasks like we are not making progress.

After the last restart, the SplitLogManager is stuck outputting every 5 seconds 
(?) its status....  that there are 4 tasks being split.

Finally, I see this: 

2014-04-06 21:08:53,346 INFO  [main-EventThread] master.SplitLogManager: task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 entered state: ERR e1321.halxg.cloudera.com,36020,1396841215304

And then the log is successfully split.  This is a few hours later.

Here is what  it has over on the slave worker node:

{code}
2014-04-06 20:26:57,685 INFO  
[SplitLogWorker-e1321.halxg.cloudera.com,36020,1396841215304] 
regionserver.SplitLogWorker: worker 
e1321.halxg.cloudera.com,36020,1396841215304 acquired task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
....
2014-04-06 21:08:53,370 INFO  [RS_LOG_REPLAY_OPS-e1321:36020-0] 
handler.HLogSplitterHandler: successfully transitioned task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 to final state ERR e1321.halxg.cloudera.com,36020,1396841215304
2014-04-06 21:08:53,370 INFO  [RS_LOG_REPLAY_OPS-e1321:36020-0] 
handler.HLogSplitterHandler: worker 
e1321.halxg.cloudera.com,36020,1396841215304 done with task 
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
 in 2515676ms
{code}

It took 42 minutes to fail the split.

On this retry, the replay actually worked.

I see a bunch of ERR fails over in a slave log.  I wonder if it is an 
unassigned region?  I don't see errors.  I see unassigned region errors but 
they look legit.

Any input [~jeffreyz]?  Thanks boss.





> Log splitting seems to hang
> ---------------------------
>
>                 Key: HBASE-10922
>                 URL: https://issues.apache.org/jira/browse/HBASE-10922
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Jimmy Xiang
>         Attachments: log-splitting_hang.png, master-log-grep.txt
>
>
> With distributed log replay enabled by default, I ran into an issue that log 
> splitting hasn't completed after 13 hours. It seems to hang somewhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HBASE-10922) Log splitting seems to hang

Reply via email to