Hi,

I was kind of hoping that this was a known thing and I was just overlooking 
something. Apparently it requires more investigation.

I reproduced the situation and gathered some more info. I still have to look 
through everything myself a bit more (today is a national holiday in The 
Netherlands, so not much working), but I posted some info to pastebin.

I did the following sequence (with HBase 0.20.4):
- startup HBase (waited for all the regions to come online and let it settle)
- startup our application
- wait for the importer job to hang (it only happened on the second run, which 
started 15 reducers; the first run was really small and only one key was 
generated, so just one reducer)
- kill the hanging importer job (hadoop job -kill)
- try to shutdown HBase (as I type this it is still producing dots on my 
console)

The HBase master logs are here (includes shutdown attempt): 
http://pastebin.com/PYpPVcyK
The jstacks are here:
- HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps, one 
during operation with the hanging clients and one during hanging shutdown)
- RegionServer 1: http://pastebin.com/5dQXfxCn
- RegionServer 2: http://pastebin.com/XWwBGXYC
- RegionServer 3: http://pastebin.com/mDgWbYGV
- RegionServer 4: http://pastebin.com/XDR14bth

As you can see in the master logs, the shutdown cannot get a thread called 
Thread-10 to stop running. The trace for that thread looks like this:
"Thread-10" prio=10 tid=0x000000004d218800 nid=0x1e73 in Object.wait() 
[0x00000000427a7000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab364c9d0> (a java.lang.Object)
        at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:89)
        - locked <0x00002aaab364c9d0> (a java.lang.Object)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:76)

I still have no clue what happened, but I will investigate a bit more tomorrow.


Thanks for the responses.


Friso



On May 12, 2010, at 9:02 PM, Todd Lipcon wrote:

> Hi Friso,
> 
> Also, if you can capture a jstack of the regionservers at thie time
> that would be great.
> 
> -Todd
> 
> On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans <jdcry...@apache.org> 
> wrote:
>> Friso,
>> 
>> Unfortunately it's hard to determine the cause with the provided
>> information, the client call you pasted is pretty much normal i.e. the
>> client is waiting to receive a result from a region server.
>> 
>> The fact that you can't shut down the master when this happens is very
>> concerning. Do you still have those logs around? Same for the region
>> servers? Can you post this in pastebin or on a web server?
>> 
>> Also, feel free to come chat with us on IRC, it's always easier to
>> debug when live. #hbase on freenode
>> 
>> J-D
>> 
>> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
>> <fvanvollenho...@xebia.com> wrote:
>>> Hi all,
>>> 
>>> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
>>> minutes). There are a number of import processes, but generally they all 
>>> create a sequence file on HDFS, which is then run through a MapReduce job. 
>>> The MapReduce uses the identity mapper (the input file is a Hadoop sequence 
>>> file) and a specialized reducer that does the following:
>>> - Combine the values for a key into one value
>>> - Do a Get from HBase to retrieve existing values for the same key
>>> - Combine the existing value from HBase and the new one into one value again
>>> - Put the final value into HBase under the same key (thus 'overwrite' the 
>>> existing row; I keep only one version)
>>> 
>>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
>>> hanging on a Get. When the jobs start, some reducers run to completion 
>>> fine, but after a while the last reducers will start to hang. Eventually 
>>> the reducers are killed of by Hadoop (after 600 secs).
>>> 
>>> I did a thread dump for one of the hanging reducers. It looks like this:
>>> "main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait() 
>>> [0x00000000420ca000]
>>>   java.lang.Thread.State: WAITING (on object monitor)
>>>        at java.lang.Object.wait(Native Method)
>>>        - waiting on <0x00002aaaaeb50d70> (a 
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>>        at java.lang.Object.wait(Object.java:485)
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
>>>        - locked <0x00002aaaaeb50d70> (a 
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>>        at 
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>        at $Proxy2.get(Unknown Source)
>>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
>>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
>>>        at 
>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
>>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
>>>        at 
>>> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
>>>        at 
>>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
>>>        at 
>>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
>>>        at 
>>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
>>>        at 
>>> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
>>>        at 
>>> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
>>>        at 
>>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
>>>        at 
>>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
>>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>        at 
>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>> 
>>> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. 
>>> I looked at the code. The wait is in a while() loop and has no time out, so 
>>> it figures that it never gets out of there if no notify() gets called on 
>>> the object. I am not sure for exactly what condition it is waiting, however.
>>> 
>>> Meanwhile, after this has happened, I cannot shutdown the master server 
>>> normally. I have to kill -9 it, to make it shut down. Normally and before 
>>> this problem occurs, the master server shuts down just fine. (Sorry, didn't 
>>> do a thread dump of the master and now I downgraded to 0.20.3 again.)
>>> 
>>> I cannot reproduce this error on my local setup (developer machine). It 
>>> only occurs on our (currently modest) cluster of one machine running 
>>> master+NN+Zookeeper and four datanodes which are all task trackers and 
>>> region servers as well. The inputs to the periodic MapReduce jobs are very 
>>> small (ranging from some Kb to several Mb) and thus contain not so many 
>>> records. I know this is not very efficient to do in MapReduce and will be 
>>> faster when inserted in process by the importer process because of startup 
>>> overhead, but we are setting up this architecture of importers and 
>>> insertion for anticipated larger loads (up to 80 million records per day).
>>> 
>>> Does anyone have a clue about what happens? Or where to look for further 
>>> investigation?
>>> 
>>> Thanks a lot!
>>> 
>>> 
>>> Cheers,
>>> Friso
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to