Hi Friso,

Also, if you can capture a jstack of the regionservers at thie time
that would be great.

-Todd

On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> Friso,
>
> Unfortunately it's hard to determine the cause with the provided
> information, the client call you pasted is pretty much normal i.e. the
> client is waiting to receive a result from a region server.
>
> The fact that you can't shut down the master when this happens is very
> concerning. Do you still have those logs around? Same for the region
> servers? Can you post this in pastebin or on a web server?
>
> Also, feel free to come chat with us on IRC, it's always easier to
> debug when live. #hbase on freenode
>
> J-D
>
> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
> <fvanvollenho...@xebia.com> wrote:
>> Hi all,
>>
>> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
>> minutes). There are a number of import processes, but generally they all 
>> create a sequence file on HDFS, which is then run through a MapReduce job. 
>> The MapReduce uses the identity mapper (the input file is a Hadoop sequence 
>> file) and a specialized reducer that does the following:
>> - Combine the values for a key into one value
>> - Do a Get from HBase to retrieve existing values for the same key
>> - Combine the existing value from HBase and the new one into one value again
>> - Put the final value into HBase under the same key (thus 'overwrite' the 
>> existing row; I keep only one version)
>>
>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
>> hanging on a Get. When the jobs start, some reducers run to completion fine, 
>> but after a while the last reducers will start to hang. Eventually the 
>> reducers are killed of by Hadoop (after 600 secs).
>>
>> I did a thread dump for one of the hanging reducers. It looks like this:
>> "main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait() 
>> [0x00000000420ca000]
>>   java.lang.Thread.State: WAITING (on object monitor)
>>        at java.lang.Object.wait(Native Method)
>>        - waiting on <0x00002aaaaeb50d70> (a 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>        at java.lang.Object.wait(Object.java:485)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
>>        - locked <0x00002aaaaeb50d70> (a 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>        at $Proxy2.get(Unknown Source)
>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
>>        at 
>> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
>>        at 
>> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
>>        at 
>> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
>>        at 
>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
>>        at 
>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>        at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. 
>> I looked at the code. The wait is in a while() loop and has no time out, so 
>> it figures that it never gets out of there if no notify() gets called on the 
>> object. I am not sure for exactly what condition it is waiting, however.
>>
>> Meanwhile, after this has happened, I cannot shutdown the master server 
>> normally. I have to kill -9 it, to make it shut down. Normally and before 
>> this problem occurs, the master server shuts down just fine. (Sorry, didn't 
>> do a thread dump of the master and now I downgraded to 0.20.3 again.)
>>
>> I cannot reproduce this error on my local setup (developer machine). It only 
>> occurs on our (currently modest) cluster of one machine running 
>> master+NN+Zookeeper and four datanodes which are all task trackers and 
>> region servers as well. The inputs to the periodic MapReduce jobs are very 
>> small (ranging from some Kb to several Mb) and thus contain not so many 
>> records. I know this is not very efficient to do in MapReduce and will be 
>> faster when inserted in process by the importer process because of startup 
>> overhead, but we are setting up this architecture of importers and insertion 
>> for anticipated larger loads (up to 80 million records per day).
>>
>> Does anyone have a clue about what happens? Or where to look for further 
>> investigation?
>>
>> Thanks a lot!
>>
>>
>> Cheers,
>> Friso
>>
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to