Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

alex kamil Tue, 20 May 2014 11:52:53 -0700

>"with 3 VMWare "master" nodes"

can you try running hbase/phoenix on physical nodes instead of using
virtual machines



On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com>wrote:

> Sorry to follow up on my own message, but I was wondering if anyone had
> any ideas? Normal non-phoenix scans don't cause this symptom, but right
> after a select * on the exact same table will.
>
> If we export the table and then re-import it into a new table, the new
> table doesn't exhibit these symptoms, same as if we use an upsert..select
> to do a copy. It seems something happens to the last region to cause this,
> but it is not directly data dependent. Moving the region to another
> regionserver doesn't have any effect - just moves where the problem
> happens. Major compactions get hung up by the running threads as they
> probably have a lock.
>
> I've run the hfile tool on the final region and nothing seems awry.
>
> Figuring this out will allow this project to continue, as of now it is
> hung up on this issue.
>
> thank you,
> -chris
>
>
> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> wrote:
>
> > I did some "poor mans" profiling with multiple jstack and came up with
> where the RpcServer.handler appears to be stuck:
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
> That is the last deepest method in all of the traces, either
> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
> thread (which has been running for the last couple hours):
> >
> > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> nid=0x727b runnable [0x00007f4b49e9e000]
> >   java.lang.Thread.State: RUNNABLE
> >       at
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> >       at
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >       at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >       at java.lang.Thread.run(Thread.java:744)
> >
> > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> nid=0x727b runnable [0x00007f4b49e9e000]
> >   java.lang.Thread.State: RUNNABLE
> >       at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> >       at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >       at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >       at java.lang.Thread.run(Thread.java:744)
> >
> >
> > On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com>
> wrote:
> >
> >>
> >> Hey Chris,
> >>
> >> I used performance.py tool which created a table with 50K rows in one
> >> table, run the following query from sqlline.py and everything seems fine
> >> without seeing CPU running hot.
> >>
> >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >> PERFORMANCE_50000;
> >> +------------+
> >> |  COUNT(1)  |
> >> +------------+
> >> | 50000      |
> >> +------------+
> >> 1 row selected (0.166 seconds)
> >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >> PERFORMANCE_50000;
> >> +------------+
> >> |  COUNT(1)  |
> >> +------------+
> >> | 50000      |
> >> +------------+
> >> 1 row selected (0.167 seconds)
> >>
> >> Is there anyway could you run profiler to see where the CPU goes?
> >>
> >>
> >>
> >> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >>
> >>> Ahh, yes. Here is a pastebin for it:
> >>>
> >>> http://pastebin.com/w6mtabag
> >>>
> >>> thanks again,
> >>> -chris
> >>>
> >>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
> >>>
> >>>> Hi Chris,
> >>>>
> >>>> Attachments are filtered out by the mail server. Can you pastebin it
> >>>> some
> >>>> place?
> >>>>
> >>>> Thanks,
> >>>> Nick
> >>>>
> >>>>
> >>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>> <c...@biotiquesystems.com>wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
> >>>>> set
> >>>>> to 60 by Ambari during install process). Now we have narrowed down
> what
> >>>>> causes the CPU to increase and have some detailed logs:
> >>>>>
> >>>>> If we connect using sqlline.py and execute a select that selects one
> >>>>> row
> >>>>> using the primary_key, no increate in CPU is observed and the number
> >>>>> of RPC
> >>>>> threads in a RUNNABLE state remains the same.
> >>>>>
> >>>>> If we execute a select that scans the table such as "select count(*)
> >>>>> from
> >>>>> TABLE" or where the "where" clause only limits on non-primary key
> >>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>>> increases
> >>>>> and the CPU utilization of the regionserver increases by ~105%.
> >>>>>
> >>>>> Disconnecting the client does not have an effect and the
> >>>>> RpcServer.handler
> >>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>>>
> >>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>>> tasks
> >>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
> >>>>> of
> >>>>> 445.2M.
> >>>>>
> >>>>> I've attached the regionserver log with IPC debug logging turned on
> >>>>> right
> >>>>> when one of the Phoenix statements is executed (this statement
> actually
> >>>>> used up the last available handler).
> >>>>>
> >>>>> thanks,
> >>>>> -chris
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> From the stack, it seems you increase the default rpc handler number
> >>>>>> to
> >>>>>> about 60. All handlers are serving Get request(You can search
> >>>>>>
> >>>>>
> >>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>>> a:2
> >>>>>> 841).
> >>>>>>
> >>>>>> You can check why there are so many get requests by adding some log
> >>>>>> info
> >>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
> >>>>>> handlers per region server, it will mitigate your current issue.
> >>>>>>
> >>>>>>
> >>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >>>>>>
> >>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>>> spiking
> >>>>> up
> >>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>
> >>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with
> 3
> >>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>>> Ambari
> >>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
> hand
> >>>>>>> linked in the jar files to the hadoop lib. Everything was going
> well
> >>>>>>> and
> >>>>>>> we were able to load in ~30k records into several tables. What
> >>>>>>> happened
> >>>>>>> was after about 3-4 days of being up the regionservers became
> >>>>>>> unresponsive and started to use most of the available CPU (12 core
> >>>>>>> boxes). Nothing terribly informative was in the logs (initially we
> >>>>>>> saw
> >>>>>>> some flush messages that seemed excessive, but that was not all of
> >>>>>>> the
> >>>>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>>> able
> >>>>> to
> >>>>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>>> cluster
> >>>>>>> will be fine for a day or so but will start to lock up again.
> >>>>>>>
> >>>>>>> We've dropped the entire HBase and zookeper information and started
> >>>>>>> from
> >>>>>>> scratch, but that has not helped.
> >>>>>>>
> >>>>>>> James Taylor suggested I send this off here. I've attached a jstack
> >>>>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>>> some
> >>>>>>> light.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> -chris
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> CONFIDENTIALITY NOTICE
> >>>>>> NOTICE: This message is intended for the use of the individual or
> >>>>>> entity
> >>>>> to
> >>>>>> which it is addressed and may contain information that is
> >>>>>> confidential,
> >>>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>>> reader
> >>>>>> of this message is not the intended recipient, you are hereby
> notified
> >>>>> that
> >>>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>>> and delete it from your system. Thank You.
> >>>>>
> >>>>>
> >>>>>
> >>
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >
>
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to