[
https://issues.apache.org/jira/browse/HBASE-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576014#action_12576014
]
stack commented on HBASE-495:
-----------------------------
Here is story that I have so far.
RegionServer gets hung on DFS ('Call queue overflow discarding oldest call
batchUpdate').
Michael B notices it and shutsdown the regionserver (44.221).
The server is restarted.
Tries to check in w/ master but the master says lease still exists
HRS has no pause facility so in a tight loop writes 400k lines in 15seconds
about the master's saying the lease exists when it tries to check in w/ master
(HBASE-496)
Eventually the old HRS lease expires.
Master gives new HRS a region.
HRS tries to deploy the region. Skips 2M lines worth of edits (HBASE-472)
Region eventually opens.
Master gives the HRS more regions to open.
Meantime the region w/ all the skipped edits tries to do a compaction and runs
into DFS issue: NotReplicatedYetException. Compaction is aborted.
Others of the new regions try to compact. Fail again in DFS. Here are what
the fails are like:
{code}
2464394 2008-03-06 01:13:59,299 WARN org.apache.hadoop.fs.DFSClient:
NotReplicatedYetException sleeping
/hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
retries left 2
2464395 2008-03-06 01:14:00,902 INFO org.apache.hadoop.fs.DFSClient:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
2464396 at
org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1157)
2464397 at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1095)
2464398 at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:310)
2464399 at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
2464400 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
2464401 at java.lang.reflect.Method.invoke(Unknown Source)
2464402 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
2464403 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
2464404
2464405 at org.apache.hadoop.ipc.Client.call(Client.java:512)
2464406 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
2464407 at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
2464408 at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
2464409 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
2464410 at java.lang.reflect.Method.invoke(Unknown Source)
2464411 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
2464412 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
2464413 at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
2464414 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2065)
2464415 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:1958)
2464416 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:1479)
2464417 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1593)
2464418
2464419 2008-03-06 01:14:01,029 WARN org.apache.hadoop.fs.DFSClient:
NotReplicatedYetException sleeping
/hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
retries left 1
2464420 2008-03-06 01:14:04,231 WARN org.apache.hadoop.fs.DFSClient:
DataStreamer Exception: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
2464421 at
org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1157)
2464422 at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1095)
2464423 at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:310)
2464424 at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
2464425 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
2464426 at java.lang.reflect.Method.invoke(Unknown Source)
2464427 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
2464428 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
2464429
2464430 2008-03-06 01:14:04,231 WARN org.apache.hadoop.fs.DFSClient: Error
Recovery for block blk_1794752555243844791 bad datanode[0]
2464431 2008-03-06 01:14:04,232 ERROR org.apache.hadoop.hbase.HRegionServer:
Compaction failed for region
enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204766010394
2464432 java.io.IOException: Could not get block locations. Aborting...
2464433 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1824)
2464434 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479)
2464435 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571)
{code}
More regions to open from master. Now the open messages are for the same
region... here is illustration:
{code}
...
2464453 2008-03-06 01:16:41,090 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_080103,cD-17MphmZfwXnZVdtKy1k==,1199852162634
2464454 2008-03-06 01:28:32,670 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464455 2008-03-06 01:29:29,718 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464456 2008-03-06 01:29:35,722 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464457 2008-03-06 01:29:35,722 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204766010394
2464458 2008-03-06 01:29:41,728 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464459 2008-03-06 01:29:47,734 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464460 2008-03-06 01:29:53,740 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464461 2008-03-06 01:29:59,746 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464462 2008-03-06 01:30:05,752 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
...
{code}
Regionserver should shut itself down if its failing to open a region because of
DFS issues -- if it can recognize them as that.
Meantime, on the server, its stuck in the shutdown loop:
{code}
...
4981501 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster: process
server shutdown scanning root region on XX.XX.XX.92 finished HMaster
4981502 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster:
numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
4981503 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster: process
server shutdown scanning .META.,,1 on XX.XX.XX.96:60020 HMaster
4981504 2008-03-06 01:28:00,021 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
scanner looking at enwiki_071018,,1199837878882
4981505 2008-03-06 01:28:00,021 DEBUG org.apache.hadoop.hbase.HMaster: Server
name XX.XX.XX.226:60020 is not same as XX.XX.XX.221:60020: Passing
...
{code}
Above goes on for 10M lines over about 30 minutes. Problem is this bit of code
in regionServerStartup:
{code}
HServerInfo storedInfo = serversToServerInfo.remove(s);
if (storedInfo != null && !closed.get()) {
// The startup message was from a known server with the same name.
// Timeout the old one right away.
HServerAddress root = rootRegionLocation.get();
if (root != null && root.equals(storedInfo.getServerAddress())) {
unassignRootRegion();
}
delayedToDoQueue.put(new ProcessServerShutdown(storedInfo));
}
{code}
Don't put if server already has a shutdown queued.
OK. Two fixes needed for this issue (at least): Regionservers should shut down
if DFS probs. and don't queue a shutdown if one already queued.
> No server address listed in .META.
> ----------------------------------
>
> Key: HBASE-495
> URL: https://issues.apache.org/jira/browse/HBASE-495
> Project: Hadoop HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.16.0
> Reporter: stack
> Fix For: 0.1.0, 0.2.0
>
>
> Michael Bieniosek manufactured the following in a 0.16.0 install:
> {code}
> 08/03/06 17:52:02 DEBUG hbase.HTable: Advancing internal scanner to startKey
> g80Fi5WZHlzLqGzErrAd7V==
> 08/03/06 17:52:02 DEBUG hbase.HConnectionManager$TableServers: reloading
> table servers because: No server address listed in .META. for region
> enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> 08/03/06 17:52:12 DEBUG hbase.HConnectionManager$TableServers: reloading
> table servers because: No server address listed in .META. for region
> enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> 08/03/06 17:52:22 DEBUG hbase.HConnectionManager$TableServers: reloading
> table servers because: No server address listed in .META. for region
> enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> org.apache.hadoop.hbase.NoServerForRegionException: No server address listed
> in .META. for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:449)
> at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:346)
> at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:309)
> at org.apache.hadoop.hbase.HTable.getRegionLocation(HTable.java:103)
> at
> org.apache.hadoop.hbase.HTable$ClientScanner.nextScanner(HTable.java:854)
> at org.apache.hadoop.hbase.HTable$ClientScanner.next(HTable.java:915)
> at
> org.apache.hadoop.hbase.hql.SelectCommand.scanPrint(SelectCommand.java:233)
> at
> org.apache.hadoop.hbase.hql.SelectCommand.execute(SelectCommand.java:100)
> at
> org.apache.hadoop.hbase.hql.HQLClient.executeQuery(HQLClient.java:50)
> at org.apache.hadoop.hbase.Shell.main(Shell.java:114)
> {code}
> When I look in the .META., I see that the above region range has multiple
> mentions... : one offlined, two that have startcodes and servers associated
> and about 5 others that are just HRIs. Table is broke. At least need the
> merge of overlapping regions tool to fix. Digging more....
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.