[ https://issues.apache.org/jira/browse/HADOOP-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HADOOP-1523: -------------------------- Attachment: locks-v2.patch Here is commit message to along with this patch: HADOOP-1523 'Hung region servers waiting on write locks' On shutdown, region servers and masters were just cancelling leases without letting 'lease expired' code run -- code to clean up outstanding locks in region server. Outstanding read locks were getting in the way of region server getting necessary write locks needed for the shutdown process. Also, cleaned up messaging around shutdown so its clean -- no timeout messages as region servers try to talk to a master that has already shutdown -- even when region servers take their time going down. M src/contrib/hbase/conf/hbase-default.xml Make region server timeout 30 seconds instead of 3 minutes. Clients retry anyways. Make so its likely region servers report in their shutdown message before their lease expires on master. M src/contrib/hbase/src/java/org/apache/hadoop/hbase/Leases.java (closeAfterLeasesExpire): Added. * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HRegionServer.java Added comments. (stop): Converted from public to default access (master shuts down regionservers). (run): Use leases.closeAfterLeasesExpire instead of leases.close. Changed log of main thread exit from DEBUG to INFO. * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMaster.java (letRegionsServersShutdown): Add better explaination of shutdown process to method doc. Changed timeout waits from hbase.regionserver.msginterval to threadWakeFrequency. (regionServerReport): If closing, we used to immediately respond to region server with a MSG_REGIONSERVER_STOP. This meant that we avoided handling of the region servers MSG_REPORT_EXITING sent on shutdown so region servers had no chance to cancel their lease in the master. Reordered. Moved sending of MSG_REGIONSERVER_STOP to after handling of MSG_REPORT_EXITING. Also, in handling of MSG_REGIONSERER_STOP removed cancelling of leases. Let leases expire normally (or get cancelled when the region server comes in with MSG_RPORT_EXITING). * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMsg.java (MSG_REGIONSERVER_STOP_IN_ARRAY): Added. > Hung region servers waiting on write locks > ------------------------------------------ > > Key: HADOOP-1523 > URL: https://issues.apache.org/jira/browse/HADOOP-1523 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Assignee: stack > Attachments: locks-v2.patch > > > A couple of times this afternoon I"ve been able to manufacture a hung region > server variously stuck trying to obtain write locks either on memcache or a > row lock on HRegion. The lease expiration must not be working properly > (shutting down all open scanners). Maybe locks should be expiring. > {code} > "IPC Server handler 2 on 60010" daemon prio=5 tid=0x005167f0 nid=0x189d000 in > Object.wait() [0xb1397000..0xb1397d10] > at java.lang.Object.wait(Native Method) > - waiting on <0x0b316ba8> (a java.util.HashMap) > at java.lang.Object.wait(Object.java:474) > at org.apache.hadoop.hbase.HRegion.obtainRowLock(HRegion.java:1211) > - locked <0x0b316ba8> (a java.util.HashMap) > at org.apache.hadoop.hbase.HRegion.startUpdate(HRegion.java:1020) > at > org.apache.hadoop.hbase.HRegionServer.startUpdate(HRegionServer.java:1007) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) > "IPC Server handler 1 on 60010" daemon prio=5 tid=0x005163f0 nid=0x189cc00 in > Object.wait() [0xb1316000..0xb1316d10] > at java.lang.Object.wait(Native Method) > - waiting on <0x0b317148> (a java.lang.Integer) > at java.lang.Object.wait(Object.java:474) > at org.apache.hadoop.hbase.HLocking.obtainWriteLock(HLocking.java:82) > - locked <0x0b317148> (a java.lang.Integer) > at org.apache.hadoop.hbase.HMemcache.add(HMemcache.java:153) > at org.apache.hadoop.hbase.HRegion.commit(HRegion.java:1144) > - locked <0x0b398080> (a org.apache.hadoop.io.Text) > at org.apache.hadoop.hbase.HRegionServer.commit(HRegionServer.java:1071) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.