Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Fri, 16 May 2014 16:15:13 -0700

I did some "poor mans" profiling with multiple jstack and came up with where 
the RpcServer.handler appears to be stuck: 
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. That is 
the last deepest method in all of the traces, either HStore.java:1712 or 
HStore.java:1722. Here are two example traces for a thread (which has been 
running for the last couple hours):


"RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 
nid=0x727b runnable [0x00007f4b49e9e000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
        at 
org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
        at 
org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
        at 
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
        at 
org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:744)

"RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 
nid=0x727b runnable [0x00007f4b49e9e000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
        at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
        at 
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
        at 
org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:744)


On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:

> 
> Hey Chris,
> 
> I used performance.py tool which created a table with 50K rows in one
> table, run the following query from sqlline.py and everything seems fine
> without seeing CPU running hot.
> 
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.166 seconds)
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.167 seconds)
> 
> Is there anyway could you run profiler to see where the CPU goes?
> 
> 
> 
> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> 
>> Ahh, yes. Here is a pastebin for it:
>> 
>> http://pastebin.com/w6mtabag
>> 
>> thanks again,
>> -chris
>> 
>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>> 
>>> Hi Chris,
>>> 
>>> Attachments are filtered out by the mail server. Can you pastebin it
>>> some
>>> place?
>>> 
>>> Thanks,
>>> Nick
>>> 
>>> 
>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>> <c...@biotiquesystems.com>wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>> set
>>>> to 60 by Ambari during install process). Now we have narrowed down what
>>>> causes the CPU to increase and have some detailed logs:
>>>> 
>>>> If we connect using sqlline.py and execute a select that selects one
>>>> row
>>>> using the primary_key, no increate in CPU is observed and the number
>>>> of RPC
>>>> threads in a RUNNABLE state remains the same.
>>>> 
>>>> If we execute a select that scans the table such as "select count(*)
>>>> from
>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>> increases
>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>> 
>>>> Disconnecting the client does not have an effect and the
>>>> RpcServer.handler
>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>> 
>>>> Checking the Web Console for the Regionserver just shows 10
>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>> tasks
>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>> of
>>>> 445.2M.
>>>> 
>>>> I've attached the regionserver log with IPC debug logging turned on
>>>> right
>>>> when one of the Phoenix statements is executed (this statement actually
>>>> used up the last available handler).
>>>> 
>>>> thanks,
>>>> -chris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> From the stack, it seems you increase the default rpc handler number
>>>>> to
>>>>> about 60. All handlers are serving Get request(You can search
>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2
>>>>> 841).
>>>>> 
>>>>> You can check why there are so many get requests by adding some log
>>>>> info
>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>> handlers per region server, it will mitigate your current issue.
>>>>> 
>>>>> 
>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>>> 
>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>> spiking
>>>> up
>>>>>> to use all available CPU and becoming unresponsive.
>>>>>> 
>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>> Ambari
>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>>>> linked in the jar files to the hadoop lib. Everything was going well
>>>>>> and
>>>>>> we were able to load in ~30k records into several tables. What
>>>>>> happened
>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>> saw
>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>> the
>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>> able
>>>> to
>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>> cluster
>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>> 
>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>> from
>>>>>> scratch, but that has not helped.
>>>>>> 
>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>> some
>>>>>> light.
>>>>>> 
>>>>>> thanks,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity
>>>> to
>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>> immediately
>>>>> and delete it from your system. Thank You.
>>>> 
>>>> 
>>>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to