Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

alex kamil Tue, 20 May 2014 12:35:00 -0700

i'm just guessing but may be region servers are stuck because they can't
communicate with zookeeper or hmaster, or may be datanodes can't talk to
namenode because of some vmware networking quirk and some internal thread
is hanging, who knows,
it would take 30 min to setup up bare metal cluster vs days debugging this
hybrid setup
also I'd suggest to install directly from hbase/hadoop websites  to
eliminate additional variables





On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas <c...@biotiquesystems.com>wrote:

> Only the master nodes (HMaster, ZK and NN) are VMs. The
> datanodes/regionservers with the stuck processes all bare metal.
>
> -chris
>
> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote:
>
> >> "with 3 VMWare "master" nodes"
> >
> > can you try running hbase/phoenix on physical nodes instead of using
> > virtual machines
> >
> >
> > On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com
> >wrote:
> >
> >> Sorry to follow up on my own message, but I was wondering if anyone had
> >> any ideas? Normal non-phoenix scans don't cause this symptom, but right
> >> after a select * on the exact same table will.
> >>
> >> If we export the table and then re-import it into a new table, the new
> >> table doesn't exhibit these symptoms, same as if we use an
> upsert..select
> >> to do a copy. It seems something happens to the last region to cause
> this,
> >> but it is not directly data dependent. Moving the region to another
> >> regionserver doesn't have any effect - just moves where the problem
> >> happens. Major compactions get hung up by the running threads as they
> >> probably have a lock.
> >>
> >> I've run the hfile tool on the final region and nothing seems awry.
> >>
> >> Figuring this out will allow this project to continue, as of now it is
> >> hung up on this issue.
> >>
> >> thank you,
> >> -chris
> >>
> >>
> >> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com>
> wrote:
> >>
> >>> I did some "poor mans" profiling with multiple jstack and came up with
> >> where the RpcServer.handler appears to be stuck:
> >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
> >> That is the last deepest method in all of the traces, either
> >> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
> >> thread (which has been running for the last couple hours):
> >>>
> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> >> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>  java.lang.Thread.State: RUNNABLE
> >>>      at
> >>
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >>>      at
> >>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >>>      at java.lang.Thread.run(Thread.java:744)
> >>>
> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> >> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>  java.lang.Thread.State: RUNNABLE
> >>>      at
> >>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> >>>      at
> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >>>      at
> >>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >>>      at java.lang.Thread.run(Thread.java:744)
> >>>
> >>>
> >>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com>
> >> wrote:
> >>>
> >>>>
> >>>> Hey Chris,
> >>>>
> >>>> I used performance.py tool which created a table with 50K rows in one
> >>>> table, run the following query from sqlline.py and everything seems
> fine
> >>>> without seeing CPU running hot.
> >>>>
> >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >>>> PERFORMANCE_50000;
> >>>> +------------+
> >>>> |  COUNT(1)  |
> >>>> +------------+
> >>>> | 50000      |
> >>>> +------------+
> >>>> 1 row selected (0.166 seconds)
> >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >>>> PERFORMANCE_50000;
> >>>> +------------+
> >>>> |  COUNT(1)  |
> >>>> +------------+
> >>>> | 50000      |
> >>>> +------------+
> >>>> 1 row selected (0.167 seconds)
> >>>>
> >>>> Is there anyway could you run profiler to see where the CPU goes?
> >>>>
> >>>>
> >>>>
> >>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >>>>
> >>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>
> >>>>> http://pastebin.com/w6mtabag
> >>>>>
> >>>>> thanks again,
> >>>>> -chris
> >>>>>
> >>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi Chris,
> >>>>>>
> >>>>>> Attachments are filtered out by the mail server. Can you pastebin it
> >>>>>> some
> >>>>>> place?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Nick
> >>>>>>
> >>>>>>
> >>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>> <c...@biotiquesystems.com>wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
> been
> >>>>>>> set
> >>>>>>> to 60 by Ambari during install process). Now we have narrowed down
> >> what
> >>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>
> >>>>>>> If we connect using sqlline.py and execute a select that selects
> one
> >>>>>>> row
> >>>>>>> using the primary_key, no increate in CPU is observed and the
> number
> >>>>>>> of RPC
> >>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>
> >>>>>>> If we execute a select that scans the table such as "select
> count(*)
> >>>>>>> from
> >>>>>>> TABLE" or where the "where" clause only limits on non-primary key
> >>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>>>>> increases
> >>>>>>> and the CPU utilization of the regionserver increases by ~105%.
> >>>>>>>
> >>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>> RpcServer.handler
> >>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>>>>>
> >>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>>>>> tasks
> >>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
> heap
> >>>>>>> of
> >>>>>>> 445.2M.
> >>>>>>>
> >>>>>>> I've attached the regionserver log with IPC debug logging turned on
> >>>>>>> right
> >>>>>>> when one of the Phoenix statements is executed (this statement
> >> actually
> >>>>>>> used up the last available handler).
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> -chris
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> From the stack, it seems you increase the default rpc handler
> number
> >>>>>>>> to
> >>>>>>>> about 60. All handlers are serving Get request(You can search
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>>>>> a:2
> >>>>>>>> 841).
> >>>>>>>>
> >>>>>>>> You can check why there are so many get requests by adding some
> log
> >>>>>>>> info
> >>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of
> rpc
> >>>>>>>> handlers per region server, it will mitigate your current issue.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>>>>> spiking
> >>>>>>> up
> >>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>
> >>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
> (with
> >> 3
> >>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>>>>> Ambari
> >>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
> >> hand
> >>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
> >> well
> >>>>>>>>> and
> >>>>>>>>> we were able to load in ~30k records into several tables. What
> >>>>>>>>> happened
> >>>>>>>>> was after about 3-4 days of being up the regionservers became
> >>>>>>>>> unresponsive and started to use most of the available CPU (12
> core
> >>>>>>>>> boxes). Nothing terribly informative was in the logs (initially
> we
> >>>>>>>>> saw
> >>>>>>>>> some flush messages that seemed excessive, but that was not all
> of
> >>>>>>>>> the
> >>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>>>>> able
> >>>>>>> to
> >>>>>>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>>>>> cluster
> >>>>>>>>> will be fine for a day or so but will start to lock up again.
> >>>>>>>>>
> >>>>>>>>> We've dropped the entire HBase and zookeper information and
> started
> >>>>>>>>> from
> >>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>
> >>>>>>>>> James Taylor suggested I send this off here. I've attached a
> jstack
> >>>>>>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>>>>> some
> >>>>>>>>> light.
> >>>>>>>>>
> >>>>>>>>> thanks,
> >>>>>>>>> -chris
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>> NOTICE: This message is intended for the use of the individual or
> >>>>>>>> entity
> >>>>>>> to
> >>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>>>>> reader
> >>>>>>>> of this message is not the intended recipient, you are hereby
> >> notified
> >>>>>>> that
> >>>>>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>>>>> forwarding of this communication is strictly prohibited. If you
> have
> >>>>>>>> received this communication in error, please contact the sender
> >>>>>>> immediately
> >>>>>>>> and delete it from your system. Thank You.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >> entity to
> >>>> which it is addressed and may contain information that is
> confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >> reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >> immediately
> >>>> and delete it from your system. Thank You.
> >>>
> >>
> >>
>
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to