[
https://issues.apache.org/jira/browse/HBASE-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962244#comment-13962244
]
stack commented on HBASE-10922:
-------------------------------
The log is eventually split about 95 minutes later:
{code}
2014-04-06 21:28:12,180 INFO [RS_LOG_REPLAY_OPS-e1120:36020-1]
handler.HLogSplitterHandler: successfully transitioned task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
to final state DONE e1120.halxg.cloudera.com,36020,1396839174186
2014-04-06 21:28:12,180 INFO [RS_LOG_REPLAY_OPS-e1120:36020-1]
handler.HLogSplitterHandler: worker
e1120.halxg.cloudera.com,36020,1396839174186 done with task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
in 1891ms
2014-04-06 21:28:12,181 INFO [main-EventThread] master.SplitLogManager: task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
entered state: DONE e1120.halxg.cloudera.com,36020,1396839174186
2014-04-06 21:28:12,193 INFO [main-EventThread] wal.HLogSplitter: Archived
processed log
hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
to
hdfs://e1120:35802/hbase/oldWALs/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
2014-04-06 21:28:12,194 INFO [main-EventThread] master.SplitLogManager: Done
splitting
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}
TODO: Why is task not removed from the UI list-of-tasks?
The master is restarted around 19:52. We have been having trouble w/ this log
before the restart.
Here is some history on this log:
{code}
2014-04-06 19:50:22,222 DEBUG [main-EventThread] master.SplitLogManager: put up
splitlog task at znode
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
...
2014-04-06 19:50:22,280 INFO [main-EventThread] master.SplitLogManager: task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
acquired by e1121.halxg.cloudera.com,36020,1396838094934
...
{code}
It is acquired by a machine and it starts to work but then no more...
2014-04-06 19:50:22,305 INFO [RS_LOG_REPLAY_OPS-e1121:36020-0]
util.FSHDFSUtils: recoverLease=true, attempt=0 on
file=hdfs://e1120:35802/hbase/WALs/e1320.halxg.cloudera.com,36020,1396838053131-splitting/e1320.halxg.cloudera.com%2C36020%2C1396838053131.1396838130980
after 1ms
... I don't see the "Processed N edits ..." for this file. I see it for others.
We then 'resubmit' the task on the master:
{code}
2014-04-06 19:52:23,291 INFO
[e1120.halxg.cloudera.com,36020,1396838229159.splitLogManagerTimeoutMonitor]
master.SplitLogManager: resubmitting task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}
Master dies. Comes back. We find the outstanding 'orphan' task. Requeue a
splitting task. We are waiting on it to complete:
{code}
2014-04-06 19:53:03,579 DEBUG [M_LOG_REPLAY_OPS-e1120:36020-2]
master.SplitLogManager: Previously orphan task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
is now being waited upon
{code}
Then weirdly, we 'find' preexisting node and put up new task:
{code}
2014-04-06 19:53:03,592 DEBUG [main-EventThread]
master.SplitLogManager$CreateAsyncCallback: found pre-existing znode
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
2014-04-06 19:53:03,593 DEBUG [main-EventThread] master.SplitLogManager: put up
splitlog task at znode
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
{code}
Then resubmit tasks like we are not making progress.
After the last restart, the SplitLogManager is stuck outputting every 5 seconds
(?) its status.... that there are 4 tasks being split.
Finally, I see this:
2014-04-06 21:08:53,346 INFO [main-EventThread] master.SplitLogManager: task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
entered state: ERR e1321.halxg.cloudera.com,36020,1396841215304
And then the log is successfully split. This is a few hours later.
Here is what it has over on the slave worker node:
{code}
2014-04-06 20:26:57,685 INFO
[SplitLogWorker-e1321.halxg.cloudera.com,36020,1396841215304]
regionserver.SplitLogWorker: worker
e1321.halxg.cloudera.com,36020,1396841215304 acquired task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
....
2014-04-06 21:08:53,370 INFO [RS_LOG_REPLAY_OPS-e1321:36020-0]
handler.HLogSplitterHandler: successfully transitioned task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
to final state ERR e1321.halxg.cloudera.com,36020,1396841215304
2014-04-06 21:08:53,370 INFO [RS_LOG_REPLAY_OPS-e1321:36020-0]
handler.HLogSplitterHandler: worker
e1321.halxg.cloudera.com,36020,1396841215304 done with task
/hbase/splitWAL/WALs%2Fe1320.halxg.cloudera.com%2C36020%2C1396838053131-splitting%2Fe1320.halxg.cloudera.com%252C36020%252C1396838053131.1396838130980
in 2515676ms
{code}
It took 42 minutes to fail the split.
On this retry, the replay actually worked.
I see a bunch of ERR fails over in a slave log. I wonder if it is an
unassigned region? I don't see errors. I see unassigned region errors but
they look legit.
Any input [~jeffreyz]? Thanks boss.
> Log splitting seems to hang
> ---------------------------
>
> Key: HBASE-10922
> URL: https://issues.apache.org/jira/browse/HBASE-10922
> Project: HBase
> Issue Type: Bug
> Components: wal
> Reporter: Jimmy Xiang
> Attachments: log-splitting_hang.png, master-log-grep.txt
>
>
> With distributed log replay enabled by default, I ran into an issue that log
> splitting hasn't completed after 13 hours. It seems to hang somewhere.
--
This message was sent by Atlassian JIRA
(v6.2#6252)