Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Tue, 20 May 2014 15:04:26 -0700

Ho Jeffrey.

Thank you for taking a look. I added 
log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom 
log4j.properties in Ambari and did a rolling restart of the regionservers (and 
verified it was pushed out the nodes in /etc/hbase/conf/log4j.properties ) then 
issued a "select count(*)" on the table. Looking at the logs I don't see any 
TRACE messages. The only two log messages that happened when issuing the select 
are:


2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020] 
coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped coprocessor 
scan 
{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020] 
coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 rows for 
ungrouped coprocessor scan 
{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}

The metrics for the regionserver show only 2 total read requests since it 
started and 0 per second now. Those two requests are for the table I did the 
count on. The regionserver's UI says it has no active RPC calls.  It currently 
has one stuck handler from running that command:

"RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 
nid=0x2582 runnable [0x00007fbd5ba6c000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2888)
        at 
org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2880)
        at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
        at java.util.TreeMap.getEntry(TreeMap.java:340)
        at java.util.TreeMap.get(TreeMap.java:273)
        at 
org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(GetClosestRowBeforeTracker.java:128)
        at 
org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandidate(GetClosestRowBeforeTracker.java:107)
        at 
org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(GetClosestRowBeforeTracker.java:202)
        at 
org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1772)
        at 
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
        at 
org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:744)

It also has one listener running:

"RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 
nid=0x2579 runnable [0x00007fbd5c376000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
        - locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
        at 
org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)

If there is anything else you need, please let me know.

thanks again,
-chris


On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:

> 
> I rechecked your RPC log and found that you didn't turn on the RPC trace
> level. You can set the following so that we can see detail request &
> response. 
> 
> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> 
> If you didn't see time out exception in your region server logs, it means
> that there are something keeping issuing lots of Get requests. Have you
> used secondary index for this table? If that's the case, please
> disable/drop it to see if that helps.
> 
> 
> 
> 
> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> 
>> I will see how possible a new, all bare metal cluster is, but we have 5
>> other similarly configured clusters with no troubles. We only have an
>> issue when Phoenix does a full scan of some tables that a handler gets
>> stuck in 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>> after it has returned all of its data.
>> 
>> thank you,
>> 
>> -chris
>> 
>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> wrote:
>> 
>>> i'm just guessing but may be region servers are stuck because they can't
>>> communicate with zookeeper or hmaster, or may be datanodes can't talk to
>>> namenode because of some vmware networking quirk and some internal
>>> thread
>>> is hanging, who knows,
>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>> this
>>> hybrid setup
>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>> eliminate additional variables
>>> 
>>> 
>>> 
>>> 
>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>> <c...@biotiquesystems.com>wrote:
>>> 
>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote:
>>>> 
>>>>>> "with 3 VMWare "master" nodes"
>>>>> 
>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>> virtual machines
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com
>>>>> wrote:
>>>>> 
>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>> had
>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>> right
>>>>>> after a select * on the exact same table will.
>>>>>> 
>>>>>> If we export the table and then re-import it into a new table, the
>>>>>> new
>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>> upsert..select
>>>>>> to do a copy. It seems something happens to the last region to cause
>>>> this,
>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>> happens. Major compactions get hung up by the running threads as they
>>>>>> probably have a lock.
>>>>>> 
>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>> 
>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>> is
>>>>>> hung up on this issue.
>>>>>> 
>>>>>> thank you,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com>
>>>> wrote:
>>>>>> 
>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>> with
>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> .
>>>>>> That is the last deepest method in all of the traces, either
>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>> for a
>>>>>> thread (which has been running for the last couple hours):
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fa
>>>> stDiffDeltaEncoder.java:540)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEnc
>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFi
>>>> leReaderV2.java:1063)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>> e.java:1776)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1722)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944
>>>> )
>>>>>>>    at
>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBl
>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDa
>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekTo
>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBef
>>>> ore(HFileReaderV2.java:548)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1712)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> 
>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Hey Chris,
>>>>>>>> 
>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>> one
>>>>>>>> table, run the following query from sqlline.py and everything seems
>>>> fine
>>>>>>>> without seeing CPU running hot.
>>>>>>>> 
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>> 
>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
>>>>>>>> 
>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>> 
>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>> 
>>>>>>>>> thanks again,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Chris,
>>>>>>>>>> 
>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>> pastebin it
>>>>>>>>>> some
>>>>>>>>>> place?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>> <c...@biotiquesystems.com>wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
>>>> been
>>>>>>>>>>> set
>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>> down
>>>>>> what
>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>> 
>>>>>>>>>>> If we connect using sqlline.py and execute a select that selects
>>>> one
>>>>>>>>>>> row
>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>> number
>>>>>>>>>>> of RPC
>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>> 
>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>> count(*)
>>>>>>>>>>> from
>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>> key
>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>> threads
>>>>>>>>>>> increases
>>>>>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>>>>>> 
>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>> 
>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>> monitored
>>>>>>>>>>> tasks
>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
>>>> heap
>>>>>>>>>>> of
>>>>>>>>>>> 445.2M.
>>>>>>>>>>> 
>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>> turned on
>>>>>>>>>>> right
>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>> actually
>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>> <jzh...@hortonworks.com
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>> number
>>>>>>>>>>>> to
>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>>>>>>> a:2
>>>>>>>>>>>> 841).
>>>>>>>>>>>> 
>>>>>>>>>>>> You can check why there are so many get requests by adding some
>>>> log
>>>>>>>>>>>> info
>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>> of
>>>> rpc
>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>> issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com>
>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>>>>>> spiking
>>>>>>>>>>> up
>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>> (with
>>>>>> 3
>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>> plain
>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>> and
>>>>>> hand
>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>> going
>>>>>> well
>>>>>>>>>>>>> and
>>>>>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>>>>>> happened
>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>>>>>> unresponsive and started to use most of the available CPU (12
>>>> core
>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>> (initially
>>>> we
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>> all
>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We
>>>>>>>>>>>>> are
>>>>>>>>>>>>> able
>>>>>>>>>>> to
>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>> started
>>>>>>>>>>>>> from
>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>> jstack
>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>> shed
>>>>>>>>>>>>> some
>>>>>>>>>>>>> light.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>> or
>>>>>>>>>>>> entity
>>>>>>>>>>> to
>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>> confidential,
>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>>> the
>>>>>>>>>>>> reader
>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>> notified
>>>>>>>>>>> that
>>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>> or
>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>> have
>>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>>>> immediately
>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>> entity to
>>>>>>>> which it is addressed and may contain information that is
>>>> confidential,
>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>> reader
>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>> that
>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>> have
>>>>>>>> received this communication in error, please contact the sender
>>>>>> immediately
>>>>>>>> and delete it from your system. Thank You.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to