Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Tue, 20 May 2014 13:22:42 -0700

I will see how possible a new, all bare metal cluster is, but we have 5 other 
similarly configured clusters with no troubles. We only have an issue when 
Phoenix does a full scan of some tables that a handler gets stuck in 
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile after it 
has returned all of its data.


thank you,

-chris

On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> wrote:

> i'm just guessing but may be region servers are stuck because they can't
> communicate with zookeeper or hmaster, or may be datanodes can't talk to
> namenode because of some vmware networking quirk and some internal thread
> is hanging, who knows,
> it would take 30 min to setup up bare metal cluster vs days debugging this
> hybrid setup
> also I'd suggest to install directly from hbase/hadoop websites  to
> eliminate additional variables
> 
> 
> 
> 
> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas <c...@biotiquesystems.com>wrote:
> 
>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>> datanodes/regionservers with the stuck processes all bare metal.
>> 
>> -chris
>> 
>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote:
>> 
>>>> "with 3 VMWare "master" nodes"
>>> 
>>> can you try running hbase/phoenix on physical nodes instead of using
>>> virtual machines
>>> 
>>> 
>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com
>>> wrote:
>>> 
>>>> Sorry to follow up on my own message, but I was wondering if anyone had
>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but right
>>>> after a select * on the exact same table will.
>>>> 
>>>> If we export the table and then re-import it into a new table, the new
>>>> table doesn't exhibit these symptoms, same as if we use an
>> upsert..select
>>>> to do a copy. It seems something happens to the last region to cause
>> this,
>>>> but it is not directly data dependent. Moving the region to another
>>>> regionserver doesn't have any effect - just moves where the problem
>>>> happens. Major compactions get hung up by the running threads as they
>>>> probably have a lock.
>>>> 
>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>> 
>>>> Figuring this out will allow this project to continue, as of now it is
>>>> hung up on this issue.
>>>> 
>>>> thank you,
>>>> -chris
>>>> 
>>>> 
>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com>
>> wrote:
>>>> 
>>>>> I did some "poor mans" profiling with multiple jstack and came up with
>>>> where the RpcServer.handler appears to be stuck:
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
>>>> That is the last deepest method in all of the traces, either
>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
>>>> thread (which has been running for the last couple hours):
>>>>> 
>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>>>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>> 
>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
>>>>>     at
>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>>>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>> 
>>>>> 
>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Hey Chris,
>>>>>> 
>>>>>> I used performance.py tool which created a table with 50K rows in one
>>>>>> table, run the following query from sqlline.py and everything seems
>> fine
>>>>>> without seeing CPU running hot.
>>>>>> 
>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>> PERFORMANCE_50000;
>>>>>> +------------+
>>>>>> |  COUNT(1)  |
>>>>>> +------------+
>>>>>> | 50000      |
>>>>>> +------------+
>>>>>> 1 row selected (0.166 seconds)
>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>> PERFORMANCE_50000;
>>>>>> +------------+
>>>>>> |  COUNT(1)  |
>>>>>> +------------+
>>>>>> | 50000      |
>>>>>> +------------+
>>>>>> 1 row selected (0.167 seconds)
>>>>>> 
>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>>>> 
>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>> 
>>>>>>> http://pastebin.com/w6mtabag
>>>>>>> 
>>>>>>> thanks again,
>>>>>>> -chris
>>>>>>> 
>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Chris,
>>>>>>>> 
>>>>>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>>>>>> some
>>>>>>>> place?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
>> been
>>>>>>>>> set
>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed down
>>>> what
>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>> 
>>>>>>>>> If we connect using sqlline.py and execute a select that selects
>> one
>>>>>>>>> row
>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>> number
>>>>>>>>> of RPC
>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>> 
>>>>>>>>> If we execute a select that scans the table such as "select
>> count(*)
>>>>>>>>> from
>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>>>>>> increases
>>>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>>>> 
>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>> RpcServer.handler
>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>> 
>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>>>>>> tasks
>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
>> heap
>>>>>>>>> of
>>>>>>>>> 445.2M.
>>>>>>>>> 
>>>>>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>>>>>> right
>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>> actually
>>>>>>>>> used up the last available handler).
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>> number
>>>>>>>>>> to
>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>>>>>> a:2
>>>>>>>>>> 841).
>>>>>>>>>> 
>>>>>>>>>> You can check why there are so many get requests by adding some
>> log
>>>>>>>>>> info
>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of
>> rpc
>>>>>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>>>> spiking
>>>>>>>>> up
>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>> 
>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>> (with
>>>> 3
>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>>>>>> Ambari
>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
>>>> hand
>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
>>>> well
>>>>>>>>>>> and
>>>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>>>> happened
>>>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>>>> unresponsive and started to use most of the available CPU (12
>> core
>>>>>>>>>>> boxes). Nothing terribly informative was in the logs (initially
>> we
>>>>>>>>>>> saw
>>>>>>>>>>> some flush messages that seemed excessive, but that was not all
>> of
>>>>>>>>>>> the
>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>>>>>>> able
>>>>>>>>> to
>>>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>>>> cluster
>>>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>>>> 
>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>> started
>>>>>>>>>>> from
>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>> 
>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>> jstack
>>>>>>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>>>>>>> some
>>>>>>>>>>> light.
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>>>> entity
>>>>>>>>> to
>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>> notified
>>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>> have
>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to
>>>>>> which it is addressed and may contain information that is
>> confidential,
>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>> reader
>>>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>> received this communication in error, please contact the sender
>>>> immediately
>>>>>> and delete it from your system. Thank You.
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to