Hi Friso,

When did you take the jstack dumps of the region servers? Was it when the
reduce tasks were still hanging?

Do all of the reduce tasks hang or is it just one that gets stuck?

If, once the reduce tasks are hung, you open the hbase shell and run "count
'mytable', 100000" does it successfully count the rows?

(I'm trying to determine if the client is locked up, or one of the RSes is
locked up)

Enjoy your holiday!

-Todd

On Thu, May 13, 2010 at 12:38 AM, Friso van Vollenhoven <
fvanvollenho...@xebia.com> wrote:

> Hi,
>
> I was kind of hoping that this was a known thing and I was just overlooking
> something. Apparently it requires more investigation.
>
> I reproduced the situation and gathered some more info. I still have to
> look through everything myself a bit more (today is a national holiday in
> The Netherlands, so not much working), but I posted some info to pastebin.
>
> I did the following sequence (with HBase 0.20.4):
> - startup HBase (waited for all the regions to come online and let it
> settle)
> - startup our application
> - wait for the importer job to hang (it only happened on the second run,
> which started 15 reducers; the first run was really small and only one key
> was generated, so just one reducer)
> - kill the hanging importer job (hadoop job -kill)
> - try to shutdown HBase (as I type this it is still producing dots on my
> console)
>
> The HBase master logs are here (includes shutdown attempt):
> http://pastebin.com/PYpPVcyK
> The jstacks are here:
> - HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps,
> one during operation with the hanging clients and one during hanging
> shutdown)
> - RegionServer 1: http://pastebin.com/5dQXfxCn
> - RegionServer 2: http://pastebin.com/XWwBGXYC
> - RegionServer 3: http://pastebin.com/mDgWbYGV
> - RegionServer 4: http://pastebin.com/XDR14bth
>
> As you can see in the master logs, the shutdown cannot get a thread called
> Thread-10 to stop running. The trace for that thread looks like this:
> "Thread-10" prio=10 tid=0x000000004d218800 nid=0x1e73 in Object.wait()
> [0x00000000427a7000]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x00002aaab364c9d0> (a java.lang.Object)
>        at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:89)
>        - locked <0x00002aaab364c9d0> (a java.lang.Object)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:76)
>
> I still have no clue what happened, but I will investigate a bit more
> tomorrow.
>
>
> Thanks for the responses.
>
>
> Friso
>
>
>
> On May 12, 2010, at 9:02 PM, Todd Lipcon wrote:
>
> > Hi Friso,
> >
> > Also, if you can capture a jstack of the regionservers at thie time
> > that would be great.
> >
> > -Todd
> >
> > On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans <jdcry...@apache.org>
> wrote:
> >> Friso,
> >>
> >> Unfortunately it's hard to determine the cause with the provided
> >> information, the client call you pasted is pretty much normal i.e. the
> >> client is waiting to receive a result from a region server.
> >>
> >> The fact that you can't shut down the master when this happens is very
> >> concerning. Do you still have those logs around? Same for the region
> >> servers? Can you post this in pastebin or on a web server?
> >>
> >> Also, feel free to come chat with us on IRC, it's always easier to
> >> debug when live. #hbase on freenode
> >>
> >> J-D
> >>
> >> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
> >> <fvanvollenho...@xebia.com> wrote:
> >>> Hi all,
> >>>
> >>> I am using Hadoop (0.20.2) and HBase to periodically import data (every
> 15 minutes). There are a number of import processes, but generally they all
> create a sequence file on HDFS, which is then run through a MapReduce job.
> The MapReduce uses the identity mapper (the input file is a Hadoop sequence
> file) and a specialized reducer that does the following:
> >>> - Combine the values for a key into one value
> >>> - Do a Get from HBase to retrieve existing values for the same key
> >>> - Combine the existing value from HBase and the new one into one value
> again
> >>> - Put the final value into HBase under the same key (thus 'overwrite'
> the existing row; I keep only one version)
> >>>
> >>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes
> start hanging on a Get. When the jobs start, some reducers run to completion
> fine, but after a while the last reducers will start to hang. Eventually the
> reducers are killed of by Hadoop (after 600 secs).
> >>>
> >>> I did a thread dump for one of the hanging reducers. It looks like
> this:
> >>> "main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait()
> [0x00000000420ca000]
> >>>   java.lang.Thread.State: WAITING (on object monitor)
> >>>        at java.lang.Object.wait(Native Method)
> >>>        - waiting on <0x00002aaaaeb50d70> (a
> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
> >>>        at java.lang.Object.wait(Object.java:485)
> >>>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
> >>>        - locked <0x00002aaaaeb50d70> (a
> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
> >>>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
> >>>        at $Proxy2.get(Unknown Source)
> >>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
> >>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
> >>>        at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
> >>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
> >>>        at
> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
> >>>        at
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
> >>>        at
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
> >>>        at
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
> >>>        at
> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
> >>>        at
> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
> >>>        at
> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
> >>>        at
> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
> >>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> >>>        at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
> >>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >>>
> >>> So the client hangs in a wait() call, waiting on a HBaseClient$Call
> object. I looked at the code. The wait is in a while() loop and has no time
> out, so it figures that it never gets out of there if no notify() gets
> called on the object. I am not sure for exactly what condition it is
> waiting, however.
> >>>
> >>> Meanwhile, after this has happened, I cannot shutdown the master server
> normally. I have to kill -9 it, to make it shut down. Normally and before
> this problem occurs, the master server shuts down just fine. (Sorry, didn't
> do a thread dump of the master and now I downgraded to 0.20.3 again.)
> >>>
> >>> I cannot reproduce this error on my local setup (developer machine). It
> only occurs on our (currently modest) cluster of one machine running
> master+NN+Zookeeper and four datanodes which are all task trackers and
> region servers as well. The inputs to the periodic MapReduce jobs are very
> small (ranging from some Kb to several Mb) and thus contain not so many
> records. I know this is not very efficient to do in MapReduce and will be
> faster when inserted in process by the importer process because of startup
> overhead, but we are setting up this architecture of importers and insertion
> for anticipated larger loads (up to 80 million records per day).
> >>>
> >>> Does anyone have a clue about what happens? Or where to look for
> further investigation?
> >>>
> >>> Thanks a lot!
> >>>
> >>>
> >>> Cheers,
> >>> Friso
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to