Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Wed, 21 May 2014 14:07:25 -0700

Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem. Just to 
verify that it was the fix, I altered it back to FAST_DIFF and the problem 
re-appeared. Thank you!


Adding the changes to the client hbase-site.xml did not help the issue, but it 
may be I was not sourcing it correctly in sqlline. I made sure it was in the 
classpath being passed to java via -cp when launching it. 

thanks again,
-chris

On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:

> 
> 
> Hey Chris,
> 
> Sorry for the trace option. The right option should be below. Before you
> were able to turn on DEBUG, you just need to change it to TRACE.
> 
>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> 
> 
> 
> Do you see that the two read requests burn one of your CPU to 100%? From
> the call stacks you posted today and before, it seems there might be an
> infinite loop inside the function Htable.getRowOrBefore() which is
> deprecated and supposed to only work for meta region.(brought by Enis).
> Fortunately there is only one place in phoenix call this function, so
> could you try the following things?
> 
> set following configuration in client side hbase-site.xml to disable stats
> update 
>   phoenix.query.statsUpdateFrequency = 2147483647
>   phoenix.query.maxStatsAge = 0
> 
> After the configuration changes, could you verify if the CPU is still
> spiking for the same query?
> 
> You can also do the following:
> 
> Alter your table's ColumnFamily data block encoding to NONE as
> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
> Hbase client.
> 
> 
> We suspect that the combination of your data, FAST_DIFF data block
> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> 
> 
> 
> On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> 
>> Ho Jeffrey.
>> 
>> Thank you for taking a look. I added
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>> log4j.properties in Ambari and did a rolling restart of the regionservers
>> (and verified it was pushed out the nodes in
>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>> table. Looking at the logs I don't see any TRACE messages. The only two
>> log messages that happened when issuing the select are:
>> 
>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>> coprocessor scan 
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>> rows for ungrouped coprocessor scan
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 
>> The metrics for the regionserver show only 2 total read requests since it
>> started and 0 per second now. Those two requests are for the table I did
>> the count on. The regionserver's UI says it has no active RPC calls.  It
>> currently has one stuck handler from running that command:
>> 
>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>  java.lang.Thread.State: RUNNABLE
>>      at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 888)
>>      at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 880)
>>      at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>>      at java.util.TreeMap.getEntry(TreeMap.java:340)
>>      at java.util.TreeMap.get(TreeMap.java:273)
>>      at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>> GetClosestRowBeforeTracker.java:128)
>>      at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>> te(GetClosestRowBeforeTracker.java:107)
>>      at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>> ClosestRowBeforeTracker.java:202)
>>      at 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>> java:1772)
>>      at 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>> ore.java:1722)
>>      at 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>> a:1655)
>>      at 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>> ava:1826)
>>      at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>> 2841)
>>      at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>> llBlockingMethod(ClientProtos.java:28857)
>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>      at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>> uler.java:160)
>>      at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>> er.java:38)
>>      at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>> va:110)
>>      at java.lang.Thread.run(Thread.java:744)
>> 
>> It also has one listener running:
>> 
>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>> nid=0x2579 runnable [0x00007fbd5c376000]
>>  java.lang.Thread.State: RUNNABLE
>>      at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>      at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>      at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>      at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>      - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>>      - locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>>      - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>>      at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>      at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>      at 
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>> 
>> If there is anything else you need, please let me know.
>> 
>> thanks again,
>> -chris
>> 
>> 
>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:
>> 
>>> 
>>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>>> level. You can set the following so that we can see detail request &
>>> response. 
>>> 
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>> 
>>> If you didn't see time out exception in your region server logs, it
>>> means
>>> that there are something keeping issuing lots of Get requests. Have you
>>> used secondary index for this table? If that's the case, please
>>> disable/drop it to see if that helps.
>>> 
>>> 
>>> 
>>> 
>>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>> 
>>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>>> other similarly configured clusters with no troubles. We only have an
>>>> issue when Phoenix does a full scan of some tables that a handler gets
>>>> stuck in 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>> after it has returned all of its data.
>>>> 
>>>> thank you,
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> wrote:
>>>> 
>>>>> i'm just guessing but may be region servers are stuck because they
>>>>> can't
>>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>> to
>>>>> namenode because of some vmware networking quirk and some internal
>>>>> thread
>>>>> is hanging, who knows,
>>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>>> this
>>>>> hybrid setup
>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>> eliminate additional variables
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>> <c...@biotiquesystems.com>wrote:
>>>>> 
>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>> 
>>>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>>>> virtual machines
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>> <c...@biotiquesystems.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>>>> had
>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>> right
>>>>>>>> after a select * on the exact same table will.
>>>>>>>> 
>>>>>>>> If we export the table and then re-import it into a new table, the
>>>>>>>> new
>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>> upsert..select
>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>> cause
>>>>>> this,
>>>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>> they
>>>>>>>> probably have a lock.
>>>>>>>> 
>>>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>>>> 
>>>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>>>> is
>>>>>>>> hung up on this issue.
>>>>>>>> 
>>>>>>>> thank you,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>>>> with
>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> .
>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>> for a
>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>> Fa
>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>> nc
>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>> Fi
>>>>>> leReaderV2.java:1063)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>> or
>>>>>> e.java:1776)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1722)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>> 44
>>>>>> )
>>>>>>>>>   at
>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>> Bl
>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>> Da
>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>> To
>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>> ef
>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1712)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>> <jzh...@hortonworks.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hey Chris,
>>>>>>>>>> 
>>>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>>>> one
>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>> seems
>>>>>> fine
>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>> 
>>>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>> 
>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>> 
>>>>>>>>>>> thanks again,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>> pastebin it
>>>>>>>>>>>> some
>>>>>>>>>>>> place?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>> have
>>>>>> been
>>>>>>>>>>>>> set
>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>>>> down
>>>>>>>> what
>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>> selects
>>>>>> one
>>>>>>>>>>>>> row
>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>>>> number
>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>> count(*)
>>>>>>>>>>>>> from
>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>>>> key
>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>> threads
>>>>>>>>>>>>> increases
>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>> monitored
>>>>>>>>>>>>> tasks
>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>> Used
>>>>>> heap
>>>>>>>>>>>>> of
>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>> turned on
>>>>>>>>>>>>> right
>>>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>>>> actually
>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>> <jzh...@hortonworks.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>>>> number
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>> some
>>>>>> log
>>>>>>>>>>>>>> info
>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>>>> of
>>>>>> rpc
>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>>>> (with
>>>>>>>> 3
>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>>>> and
>>>>>>>> hand
>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>> going
>>>>>>>> well
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>> (12
>>>>>> core
>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>> (initially
>>>>>> we
>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>>>> all
>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>>>> started
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>>>> jstack
>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> entity
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>> you
>>>>>> have
>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>> sender
>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity to
>>>>>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>> the
>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>> notified
>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>> have
>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to