Hi, I was kind of hoping that this was a known thing and I was just overlooking something. Apparently it requires more investigation.
I reproduced the situation and gathered some more info. I still have to look through everything myself a bit more (today is a national holiday in The Netherlands, so not much working), but I posted some info to pastebin. I did the following sequence (with HBase 0.20.4): - startup HBase (waited for all the regions to come online and let it settle) - startup our application - wait for the importer job to hang (it only happened on the second run, which started 15 reducers; the first run was really small and only one key was generated, so just one reducer) - kill the hanging importer job (hadoop job -kill) - try to shutdown HBase (as I type this it is still producing dots on my console) The HBase master logs are here (includes shutdown attempt): http://pastebin.com/PYpPVcyK The jstacks are here: - HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps, one during operation with the hanging clients and one during hanging shutdown) - RegionServer 1: http://pastebin.com/5dQXfxCn - RegionServer 2: http://pastebin.com/XWwBGXYC - RegionServer 3: http://pastebin.com/mDgWbYGV - RegionServer 4: http://pastebin.com/XDR14bth As you can see in the master logs, the shutdown cannot get a thread called Thread-10 to stop running. The trace for that thread looks like this: "Thread-10" prio=10 tid=0x000000004d218800 nid=0x1e73 in Object.wait() [0x00000000427a7000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00002aaab364c9d0> (a java.lang.Object) at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:89) - locked <0x00002aaab364c9d0> (a java.lang.Object) at org.apache.hadoop.hbase.Chore.run(Chore.java:76) I still have no clue what happened, but I will investigate a bit more tomorrow. Thanks for the responses. Friso On May 12, 2010, at 9:02 PM, Todd Lipcon wrote: > Hi Friso, > > Also, if you can capture a jstack of the regionservers at thie time > that would be great. > > -Todd > > On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans <jdcry...@apache.org> > wrote: >> Friso, >> >> Unfortunately it's hard to determine the cause with the provided >> information, the client call you pasted is pretty much normal i.e. the >> client is waiting to receive a result from a region server. >> >> The fact that you can't shut down the master when this happens is very >> concerning. Do you still have those logs around? Same for the region >> servers? Can you post this in pastebin or on a web server? >> >> Also, feel free to come chat with us on IRC, it's always easier to >> debug when live. #hbase on freenode >> >> J-D >> >> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven >> <fvanvollenho...@xebia.com> wrote: >>> Hi all, >>> >>> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 >>> minutes). There are a number of import processes, but generally they all >>> create a sequence file on HDFS, which is then run through a MapReduce job. >>> The MapReduce uses the identity mapper (the input file is a Hadoop sequence >>> file) and a specialized reducer that does the following: >>> - Combine the values for a key into one value >>> - Do a Get from HBase to retrieve existing values for the same key >>> - Combine the existing value from HBase and the new one into one value again >>> - Put the final value into HBase under the same key (thus 'overwrite' the >>> existing row; I keep only one version) >>> >>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start >>> hanging on a Get. When the jobs start, some reducers run to completion >>> fine, but after a while the last reducers will start to hang. Eventually >>> the reducers are killed of by Hadoop (after 600 secs). >>> >>> I did a thread dump for one of the hanging reducers. It looks like this: >>> "main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait() >>> [0x00000000420ca000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> - waiting on <0x00002aaaaeb50d70> (a >>> org.apache.hadoop.hbase.ipc.HBaseClient$Call) >>> at java.lang.Object.wait(Object.java:485) >>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721) >>> - locked <0x00002aaaaeb50d70> (a >>> org.apache.hadoop.hbase.ipc.HBaseClient$Call) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333) >>> at $Proxy2.get(Unknown Source) >>> at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450) >>> at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448) >>> at >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050) >>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447) >>> at >>> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36) >>> at >>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101) >>> at >>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60) >>> at >>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40) >>> at >>> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92) >>> at >>> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142) >>> at >>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70) >>> at >>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17) >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>> at >>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) >>> at org.apache.hadoop.mapred.Child.main(Child.java:170) >>> >>> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. >>> I looked at the code. The wait is in a while() loop and has no time out, so >>> it figures that it never gets out of there if no notify() gets called on >>> the object. I am not sure for exactly what condition it is waiting, however. >>> >>> Meanwhile, after this has happened, I cannot shutdown the master server >>> normally. I have to kill -9 it, to make it shut down. Normally and before >>> this problem occurs, the master server shuts down just fine. (Sorry, didn't >>> do a thread dump of the master and now I downgraded to 0.20.3 again.) >>> >>> I cannot reproduce this error on my local setup (developer machine). It >>> only occurs on our (currently modest) cluster of one machine running >>> master+NN+Zookeeper and four datanodes which are all task trackers and >>> region servers as well. The inputs to the periodic MapReduce jobs are very >>> small (ranging from some Kb to several Mb) and thus contain not so many >>> records. I know this is not very efficient to do in MapReduce and will be >>> faster when inserted in process by the importer process because of startup >>> overhead, but we are setting up this architecture of importers and >>> insertion for anticipated larger loads (up to 80 million records per day). >>> >>> Does anyone have a clue about what happens? Or where to look for further >>> investigation? >>> >>> Thanks a lot! >>> >>> >>> Cheers, >>> Friso >>> >>> >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera