Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Tue, 20 May 2014 12:14:54 -0700

Only the master nodes (HMaster, ZK and NN) are VMs. The datanodes/regionservers 
with the stuck processes all bare metal.


-chris

On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote:

>> "with 3 VMWare "master" nodes"
> 
> can you try running hbase/phoenix on physical nodes instead of using
> virtual machines
> 
> 
> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com>wrote:
> 
>> Sorry to follow up on my own message, but I was wondering if anyone had
>> any ideas? Normal non-phoenix scans don't cause this symptom, but right
>> after a select * on the exact same table will.
>> 
>> If we export the table and then re-import it into a new table, the new
>> table doesn't exhibit these symptoms, same as if we use an upsert..select
>> to do a copy. It seems something happens to the last region to cause this,
>> but it is not directly data dependent. Moving the region to another
>> regionserver doesn't have any effect - just moves where the problem
>> happens. Major compactions get hung up by the running threads as they
>> probably have a lock.
>> 
>> I've run the hfile tool on the final region and nothing seems awry.
>> 
>> Figuring this out will allow this project to continue, as of now it is
>> hung up on this issue.
>> 
>> thank you,
>> -chris
>> 
>> 
>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> wrote:
>> 
>>> I did some "poor mans" profiling with multiple jstack and came up with
>> where the RpcServer.handler appears to be stuck:
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
>> That is the last deepest method in all of the traces, either
>> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
>> thread (which has been running for the last couple hours):
>>> 
>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>  java.lang.Thread.State: RUNNABLE
>>>      at
>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
>>>      at
>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>      at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>      at java.lang.Thread.run(Thread.java:744)
>>> 
>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>  java.lang.Thread.State: RUNNABLE
>>>      at
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
>>>      at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>      at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>      at java.lang.Thread.run(Thread.java:744)
>>> 
>>> 
>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>> wrote:
>>> 
>>>> 
>>>> Hey Chris,
>>>> 
>>>> I used performance.py tool which created a table with 50K rows in one
>>>> table, run the following query from sqlline.py and everything seems fine
>>>> without seeing CPU running hot.
>>>> 
>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>> PERFORMANCE_50000;
>>>> +------------+
>>>> |  COUNT(1)  |
>>>> +------------+
>>>> | 50000      |
>>>> +------------+
>>>> 1 row selected (0.166 seconds)
>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>> PERFORMANCE_50000;
>>>> +------------+
>>>> |  COUNT(1)  |
>>>> +------------+
>>>> | 50000      |
>>>> +------------+
>>>> 1 row selected (0.167 seconds)
>>>> 
>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>> 
>>>> 
>>>> 
>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>> 
>>>>> Ahh, yes. Here is a pastebin for it:
>>>>> 
>>>>> http://pastebin.com/w6mtabag
>>>>> 
>>>>> thanks again,
>>>>> -chris
>>>>> 
>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Chris,
>>>>>> 
>>>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>>>> some
>>>>>> place?
>>>>>> 
>>>>>> Thanks,
>>>>>> Nick
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>>>>> set
>>>>>>> to 60 by Ambari during install process). Now we have narrowed down
>> what
>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>> 
>>>>>>> If we connect using sqlline.py and execute a select that selects one
>>>>>>> row
>>>>>>> using the primary_key, no increate in CPU is observed and the number
>>>>>>> of RPC
>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>> 
>>>>>>> If we execute a select that scans the table such as "select count(*)
>>>>>>> from
>>>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>>>> increases
>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>> 
>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>> RpcServer.handler
>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>> 
>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>>>> tasks
>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>>>>> of
>>>>>>> 445.2M.
>>>>>>> 
>>>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>>>> right
>>>>>>> when one of the Phoenix statements is executed (this statement
>> actually
>>>>>>> used up the last available handler).
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> From the stack, it seems you increase the default rpc handler number
>>>>>>>> to
>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>>>> a:2
>>>>>>>> 841).
>>>>>>>> 
>>>>>>>> You can check why there are so many get requests by adding some log
>>>>>>>> info
>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>>>>>> 
>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>> spiking
>>>>>>> up
>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>> 
>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with
>> 3
>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>>>> Ambari
>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
>> hand
>>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
>> well
>>>>>>>>> and
>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>> happened
>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>>>>> saw
>>>>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>>>>> the
>>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>>>>> able
>>>>>>> to
>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>> cluster
>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>> 
>>>>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>>>>> from
>>>>>>>>> scratch, but that has not helped.
>>>>>>>>> 
>>>>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>>>>> some
>>>>>>>>> light.
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity
>>>>>>> to
>>>>>>>> which it is addressed and may contain information that is
>>>>>>>> confidential,
>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>>>> reader
>>>>>>>> of this message is not the intended recipient, you are hereby
>> notified
>>>>>>> that
>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>>>> received this communication in error, please contact the sender
>>>>>>> immediately
>>>>>>>> and delete it from your system. Thank You.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>> entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>> reader
>>>> of this message is not the intended recipient, you are hereby notified
>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>> immediately
>>>> and delete it from your system. Thank You.
>>> 
>> 
>>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to