Hi Friso, Also, if you can capture a jstack of the regionservers at thie time that would be great.
-Todd On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > Friso, > > Unfortunately it's hard to determine the cause with the provided > information, the client call you pasted is pretty much normal i.e. the > client is waiting to receive a result from a region server. > > The fact that you can't shut down the master when this happens is very > concerning. Do you still have those logs around? Same for the region > servers? Can you post this in pastebin or on a web server? > > Also, feel free to come chat with us on IRC, it's always easier to > debug when live. #hbase on freenode > > J-D > > On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven > <fvanvollenho...@xebia.com> wrote: >> Hi all, >> >> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 >> minutes). There are a number of import processes, but generally they all >> create a sequence file on HDFS, which is then run through a MapReduce job. >> The MapReduce uses the identity mapper (the input file is a Hadoop sequence >> file) and a specialized reducer that does the following: >> - Combine the values for a key into one value >> - Do a Get from HBase to retrieve existing values for the same key >> - Combine the existing value from HBase and the new one into one value again >> - Put the final value into HBase under the same key (thus 'overwrite' the >> existing row; I keep only one version) >> >> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start >> hanging on a Get. When the jobs start, some reducers run to completion fine, >> but after a while the last reducers will start to hang. Eventually the >> reducers are killed of by Hadoop (after 600 secs). >> >> I did a thread dump for one of the hanging reducers. It looks like this: >> "main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait() >> [0x00000000420ca000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> - waiting on <0x00002aaaaeb50d70> (a >> org.apache.hadoop.hbase.ipc.HBaseClient$Call) >> at java.lang.Object.wait(Object.java:485) >> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721) >> - locked <0x00002aaaaeb50d70> (a >> org.apache.hadoop.hbase.ipc.HBaseClient$Call) >> at >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333) >> at $Proxy2.get(Unknown Source) >> at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450) >> at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050) >> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447) >> at >> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36) >> at >> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101) >> at >> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60) >> at >> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40) >> at >> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92) >> at >> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142) >> at >> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70) >> at >> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >> at >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> >> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. >> I looked at the code. The wait is in a while() loop and has no time out, so >> it figures that it never gets out of there if no notify() gets called on the >> object. I am not sure for exactly what condition it is waiting, however. >> >> Meanwhile, after this has happened, I cannot shutdown the master server >> normally. I have to kill -9 it, to make it shut down. Normally and before >> this problem occurs, the master server shuts down just fine. (Sorry, didn't >> do a thread dump of the master and now I downgraded to 0.20.3 again.) >> >> I cannot reproduce this error on my local setup (developer machine). It only >> occurs on our (currently modest) cluster of one machine running >> master+NN+Zookeeper and four datanodes which are all task trackers and >> region servers as well. The inputs to the periodic MapReduce jobs are very >> small (ranging from some Kb to several Mb) and thus contain not so many >> records. I know this is not very efficient to do in MapReduce and will be >> faster when inserted in process by the importer process because of startup >> overhead, but we are setting up this architecture of importers and insertion >> for anticipated larger loads (up to 80 million records per day). >> >> Does anyone have a clue about what happens? Or where to look for further >> investigation? >> >> Thanks a lot! >> >> >> Cheers, >> Friso >> >> > -- Todd Lipcon Software Engineer, Cloudera