Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Wed, 21 May 2014 15:45:27 -0700

I'll start on building a patched version then trying it out (Now that it is 
working I'll have to wrangle the cluster away from the users). I cannot send 
the data, it private data.


thanks,
-chris


On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:

> 
> Thanks for reporting the status back. Mind to try the patch on phoenix-992
> or attach the data which can repro the issue so that we can verify current
> fix. Thanks.
> 
> On 5/21/14 2:06 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> 
>> 
>> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
>> Just to verify that it was the fix, I altered it back to FAST_DIFF and
>> the problem re-appeared. Thank you!
>> 
>> Adding the changes to the client hbase-site.xml did not help the issue,
>> but it may be I was not sourcing it correctly in sqlline. I made sure it
>> was in the classpath being passed to java via -cp when launching it.
>> 
>> thanks again,
>> -chris
>> 
>> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:
>> 
>>> 
>>> 
>>> Hey Chris,
>>> 
>>> Sorry for the trace option. The right option should be below. Before you
>>> were able to turn on DEBUG, you just need to change it to TRACE.
>>> 
>>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
>>> 
>>> 
>>> 
>>> Do you see that the two read requests burn one of your CPU to 100%? From
>>> the call stacks you posted today and before, it seems there might be an
>>> infinite loop inside the function Htable.getRowOrBefore() which is
>>> deprecated and supposed to only work for meta region.(brought by Enis).
>>> Fortunately there is only one place in phoenix call this function, so
>>> could you try the following things?
>>> 
>>> set following configuration in client side hbase-site.xml to disable
>>> stats
>>> update 
>>>  phoenix.query.statsUpdateFrequency = 2147483647
>>>  phoenix.query.maxStatsAge = 0
>>> 
>>> After the configuration changes, could you verify if the CPU is still
>>> spiking for the same query?
>>> 
>>> You can also do the following:
>>> 
>>> Alter your table's ColumnFamily data block encoding to NONE as
>>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
>>> through
>>> Hbase client.
>>> 
>>> 
>>> We suspect that the combination of your data, FAST_DIFF data block
>>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
>>> 
>>> 
>>> 
>>> On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>> 
>>>> Ho Jeffrey.
>>>> 
>>>> Thank you for taking a look. I added
>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
>>>> Custom
>>>> log4j.properties in Ambari and did a rolling restart of the
>>>> regionservers
>>>> (and verified it was pushed out the nodes in
>>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
>>>> the
>>>> table. Looking at the logs I don't see any TRACE messages. The only two
>>>> log messages that happened when issuing the select are:
>>>> 
>>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>>>> coprocessor scan
>>>> 
>>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>> 00
>>>> 
>>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>> ca
>>>> 
>>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>> ":
>>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>>>> rows for ungrouped coprocessor scan
>>>> 
>>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>> 00
>>>> 
>>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>> ca
>>>> 
>>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>> ":
>>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>>> 
>>>> The metrics for the regionserver show only 2 total read requests since
>>>> it
>>>> started and 0 per second now. Those two requests are for the table I
>>>> did
>>>> the count on. The regionserver's UI says it has no active RPC calls.
>>>> It
>>>> currently has one stuck handler from running that command:
>>>> 
>>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>> :2
>>>> 888)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>> :2
>>>> 880)
>>>>    at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>>>>    at java.util.TreeMap.getEntry(TreeMap.java:340)
>>>>    at java.util.TreeMap.get(TreeMap.java:273)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
>>>> d(
>>>> GetClosestRowBeforeTracker.java:128)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
>>>> da
>>>> te(GetClosestRowBeforeTracker.java:107)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
>>>> et
>>>> ClosestRowBeforeTracker.java:202)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>> e.
>>>> java:1772)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> St
>>>> ore.java:1722)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> av
>>>> a:1655)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .j
>>>> ava:1826)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:
>>>> 2841)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> ca
>>>> llBlockingMethod(ClientProtos.java:28857)
>>>>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> ed
>>>> uler.java:160)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> ul
>>>> er.java:38)
>>>>    at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> ja
>>>> va:110)
>>>>    at java.lang.Thread.run(Thread.java:744)
>>>> 
>>>> It also has one listener running:
>>>> 
>>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>>>> nid=0x2579 runnable [0x00007fbd5c376000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>>>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>>>    - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>>>>    - locked <0x000000059ae289a0> (a
>>>> java.util.Collections$UnmodifiableSet)
>>>>    - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>>>    at 
>>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>>>> 
>>>> If there is anything else you need, please let me know.
>>>> 
>>>> thanks again,
>>>> -chris
>>>> 
>>>> 
>>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> I rechecked your RPC log and found that you didn't turn on the RPC
>>>>> trace
>>>>> level. You can set the following so that we can see detail request &
>>>>> response. 
>>>>> 
>>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>>>> 
>>>>> If you didn't see time out exception in your region server logs, it
>>>>> means
>>>>> that there are something keeping issuing lots of Get requests. Have
>>>>> you
>>>>> used secondary index for this table? If that's the case, please
>>>>> disable/drop it to see if that helps.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>>> 
>>>>>> I will see how possible a new, all bare metal cluster is, but we
>>>>>> have 5
>>>>>> other similarly configured clusters with no troubles. We only have an
>>>>>> issue when Phoenix does a full scan of some tables that a handler
>>>>>> gets
>>>>>> stuck in 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> after it has returned all of its data.
>>>>>> 
>>>>>> thank you,
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> i'm just guessing but may be region servers are stuck because they
>>>>>>> can't
>>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
>>>>>>> talk
>>>>>>> to
>>>>>>> namenode because of some vmware networking quirk and some internal
>>>>>>> thread
>>>>>>> is hanging, who knows,
>>>>>>> it would take 30 min to setup up bare metal cluster vs days
>>>>>>> debugging
>>>>>>> this
>>>>>>> hybrid setup
>>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>>>> eliminate additional variables
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>>> 
>>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>>>> 
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>>>> 
>>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
>>>>>>>>> using
>>>>>>>>> virtual machines
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>>>> <c...@biotiquesystems.com
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
>>>>>>>>>> anyone
>>>>>>>>>> had
>>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>>>> right
>>>>>>>>>> after a select * on the exact same table will.
>>>>>>>>>> 
>>>>>>>>>> If we export the table and then re-import it into a new table,
>>>>>>>>>> the
>>>>>>>>>> new
>>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>>>> upsert..select
>>>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>>>> cause
>>>>>>>> this,
>>>>>>>>>> but it is not directly data dependent. Moving the region to
>>>>>>>>>> another
>>>>>>>>>> regionserver doesn't have any effect - just moves where the
>>>>>>>>>> problem
>>>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>>>> they
>>>>>>>>>> probably have a lock.
>>>>>>>>>> 
>>>>>>>>>> I've run the hfile tool on the final region and nothing seems
>>>>>>>>>> awry.
>>>>>>>>>> 
>>>>>>>>>> Figuring this out will allow this project to continue, as of now
>>>>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>> hung up on this issue.
>>>>>>>>>> 
>>>>>>>>>> thank you,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
>>>>>>>>>> <c...@biotiquesystems.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
>>>>>>>>>>> up
>>>>>>>>>>> with
>>>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
>>>>>>>>>> Fi
>>>>>>>>>> le
>>>>>>>>>> .
>>>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>>>> for a
>>>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>>>> 
>>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
>>>>>>>> t(
>>>>>>>> Fa
>>>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
>>>>>>>> dE
>>>>>>>> nc
>>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
>>>>>>>> (H
>>>>>>>> Fi
>>>>>>>> leReaderV2.java:1063)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
>>>>>>>> St
>>>>>>>> or
>>>>>>>> e.java:1776)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> (H
>>>>>>>> Store.java:1722)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>> re
>>>>>>>> .j
>>>>>>>> ava:1655)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>> gi
>>>>>>>> on
>>>>>>>> .java:1826)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>> a:2841)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>> e$
>>>>>>>> 2.
>>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>> cS
>>>>>>>> ch
>>>>>>>> eduler.java:160)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>> ch
>>>>>>>> ed
>>>>>>>> uler.java:38)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>> le
>>>>>>>> r.
>>>>>>>> java:110)
>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>> 
>>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
>>>>>>>> 19
>>>>>>>> 44
>>>>>>>> )
>>>>>>>>>>>  at
>>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
>>>>>>>> ot
>>>>>>>> Bl
>>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
>>>>>>>> ad
>>>>>>>> Da
>>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
>>>>>>>> ek
>>>>>>>> To
>>>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
>>>>>>>> kB
>>>>>>>> ef
>>>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> (H
>>>>>>>> Store.java:1712)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>> re
>>>>>>>> .j
>>>>>>>> ava:1655)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>> gi
>>>>>>>> on
>>>>>>>> .java:1826)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>> a:2841)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>> e$
>>>>>>>> 2.
>>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>> cS
>>>>>>>> ch
>>>>>>>> eduler.java:160)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>> ch
>>>>>>>> ed
>>>>>>>> uler.java:38)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>> le
>>>>>>>> r.
>>>>>>>> java:110)
>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>>>> <jzh...@hortonworks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hey Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
>>>>>>>>>>>> in
>>>>>>>>>>>> one
>>>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>>>> seems
>>>>>>>> fine
>>>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | 50000      |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | 50000      |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
>>>>>>>>>>>> goes?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks again,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>>>> pastebin it
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> place?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>>>> have
>>>>>>>> been
>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
>>>>>>>>>>>>>>> narrowed
>>>>>>>>>>>>>>> down
>>>>>>>>>> what
>>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>>>> selects
>>>>>>>> one
>>>>>>>>>>>>>>> row
>>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
>>>>>>>>>>>>>>> the
>>>>>>>> number
>>>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>>>> count(*)
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
>>>>>>>>>>>>>>> non-primary
>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>>> increases
>>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
>>>>>>>>>>>>>>> usage.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>>>> monitored
>>>>>>>>>>>>>>> tasks
>>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>>>> Used
>>>>>>>> heap
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>>>> turned on
>>>>>>>>>>>>>>> right
>>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
>>>>>>>>>>>>>>> statement
>>>>>>>>>> actually
>>>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>>>> <jzh...@hortonworks.com
>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
>>>>>>>>>>>>>>>> handler
>>>>>>>> number
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
>>>>>>>>>>>>>>>> search
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
>>>>>>>>>> er
>>>>>>>>>> .j
>>>>>>>>>> av
>>>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>>>> some
>>>>>>>> log
>>>>>>>>>>>>>>>> info
>>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> of
>>>>>>>> rpc
>>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
>>>>>>>>>>>>>>>> <c...@biotiquesystems.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
>>>>>>>>>>>>>>>>> cluster
>>>>>>>> (with
>>>>>>>>>> 3
>>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
>>>>>>>>>>>>>>>>> release
>>>>>>>>>>>>>>>>> and
>>>>>>>>>> hand
>>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>>>> going
>>>>>>>>>> well
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>>>> (12
>>>>>>>> core
>>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>>>> (initially
>>>>>>>> we
>>>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> all
>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> able
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
>>>>>>>>>>>>>>>>> and
>>>>>>>> started
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
>>>>>>>>>>>>>>>>> attached a
>>>>>>>> jstack
>>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> entity
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
>>>>>>>>>>>>>>>> hereby
>>>>>>>>>> notified
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>>>> you
>>>>>>>> have
>>>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>>>> sender
>>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>> or
>>>>>>>>>> entity to
>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>> confidential,
>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>>> the
>>>>>>>>>> reader
>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>>>> notified
>>>>>>>>>> that
>>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>> or
>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>>>> have
>>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>>> immediately
>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to 
>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader 
>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that 
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>>> immediately 
>>>>> and delete it from your system. Thank You.
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to