Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Chris Tarnas Fri, 23 May 2014 12:22:26 -0700

I am trying to re-produce this issue on our sandbox cluster. No matter how I 
try to insert the same data (random, sorted, reverse sorted, bulk, with 
upserrts, etc) I don't see this problem. I'll be copying the actual hfiles of 
the problematic table over next to see if that can do it.


-chris


On May 21, 2014, at 3:58 PM, Enis Soztutar <e...@hortonworks.com> wrote:

> Hey Chris, 
> 
> Thanks for bearing with us. It seems that the FAST_DIFF encoding + your data 
> schema + use of getRowOrBefore() caused the issue. That is why Jeffrey was 
> asking for the data. It is fine. getRowOrBefore() will go away anyway 
> (https://issues.apache.org/jira/browse/HBASE-11230). Jeffrey's patch changes 
> Phoenix to use a "reverse scan" which should not have an issue with 
> FAST_DIFF. It would be awesome if you can test it with your data so that we 
> can see whether it is a problem with reverse lookups in FAST_DIFF encoding, 
> or a specific problem with getRowOrBefore()
> 
> Thanks for reporting. 
> Enis
> 
> 
> On Wed, May 21, 2014 at 3:44 PM, Chris Tarnas <c...@biotiquesystems.com> 
> wrote:
> I'll start on building a patched version then trying it out (Now that it is 
> working I'll have to wrangle the cluster away from the users). I cannot send 
> the data, it private data.
> 
> thanks,
> -chris
> 
> 
> On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:
> 
> >
> > Thanks for reporting the status back. Mind to try the patch on phoenix-992
> > or attach the data which can repro the issue so that we can verify current
> > fix. Thanks.
> >
> > On 5/21/14 2:06 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >
> >>
> >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
> >> Just to verify that it was the fix, I altered it back to FAST_DIFF and
> >> the problem re-appeared. Thank you!
> >>
> >> Adding the changes to the client hbase-site.xml did not help the issue,
> >> but it may be I was not sourcing it correctly in sqlline. I made sure it
> >> was in the classpath being passed to java via -cp when launching it.
> >>
> >> thanks again,
> >> -chris
> >>
> >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote:
> >>
> >>>
> >>>
> >>> Hey Chris,
> >>>
> >>> Sorry for the trace option. The right option should be below. Before you
> >>> were able to turn on DEBUG, you just need to change it to TRACE.
> >>>
> >>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> >>>
> >>>
> >>>
> >>> Do you see that the two read requests burn one of your CPU to 100%? From
> >>> the call stacks you posted today and before, it seems there might be an
> >>> infinite loop inside the function Htable.getRowOrBefore() which is
> >>> deprecated and supposed to only work for meta region.(brought by Enis).
> >>> Fortunately there is only one place in phoenix call this function, so
> >>> could you try the following things?
> >>>
> >>> set following configuration in client side hbase-site.xml to disable
> >>> stats
> >>> update
> >>>  phoenix.query.statsUpdateFrequency = 2147483647
> >>>  phoenix.query.maxStatsAge = 0
> >>>
> >>> After the configuration changes, could you verify if the CPU is still
> >>> spiking for the same query?
> >>>
> >>> You can also do the following:
> >>>
> >>> Alter your table's ColumnFamily data block encoding to NONE as
> >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
> >>> through
> >>> Hbase client.
> >>>
> >>>
> >>> We suspect that the combination of your data, FAST_DIFF data block
> >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> >>>
> >>>
> >>>
> >>> On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >>>
> >>>> Ho Jeffrey.
> >>>>
> >>>> Thank you for taking a look. I added
> >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
> >>>> Custom
> >>>> log4j.properties in Ambari and did a rolling restart of the
> >>>> regionservers
> >>>> (and verified it was pushed out the nodes in
> >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
> >>>> the
> >>>> table. Looking at the logs I don't see any TRACE messages. The only two
> >>>> log messages that happened when issuing the select are:
> >>>>
> >>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
> >>>> coprocessor scan
> >>>>
> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
> >>>> rows for ungrouped coprocessor scan
> >>>>
> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>>
> >>>> The metrics for the regionserver show only 2 total read requests since
> >>>> it
> >>>> started and 0 per second now. Those two requests are for the table I
> >>>> did
> >>>> the count on. The regionserver's UI says it has no active RPC calls.
> >>>> It
> >>>> currently has one stuck handler from running that command:
> >>>>
> >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
> >>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 888)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 880)
> >>>>    at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
> >>>>    at java.util.TreeMap.getEntry(TreeMap.java:340)
> >>>>    at java.util.TreeMap.get(TreeMap.java:273)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
> >>>> d(
> >>>> GetClosestRowBeforeTracker.java:128)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
> >>>> da
> >>>> te(GetClosestRowBeforeTracker.java:107)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
> >>>> et
> >>>> ClosestRowBeforeTracker.java:202)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
> >>>> e.
> >>>> java:1772)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
> >>>> St
> >>>> ore.java:1722)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
> >>>> av
> >>>> a:1655)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
> >>>> .j
> >>>> ava:1826)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>> a:
> >>>> 2841)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
> >>>> ca
> >>>> llBlockingMethod(ClientProtos.java:28857)
> >>>>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
> >>>> ed
> >>>> uler.java:160)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
> >>>> ul
> >>>> er.java:38)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
> >>>> ja
> >>>> va:110)
> >>>>    at java.lang.Thread.run(Thread.java:744)
> >>>>
> >>>> It also has one listener running:
> >>>>
> >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
> >>>> nid=0x2579 runnable [0x00007fbd5c376000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>>    - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
> >>>>    - locked <0x000000059ae289a0> (a
> >>>> java.util.Collections$UnmodifiableSet)
> >>>>    - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
> >>>>    at
> >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
> >>>>
> >>>> If there is anything else you need, please let me know.
> >>>>
> >>>> thanks again,
> >>>> -chris
> >>>>
> >>>>
> >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> I rechecked your RPC log and found that you didn't turn on the RPC
> >>>>> trace
> >>>>> level. You can set the following so that we can see detail request &
> >>>>> response.
> >>>>>
> >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> >>>>>
> >>>>> If you didn't see time out exception in your region server logs, it
> >>>>> means
> >>>>> that there are something keeping issuing lots of Get requests. Have
> >>>>> you
> >>>>> used secondary index for this table? If that's the case, please
> >>>>> disable/drop it to see if that helps.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote:
> >>>>>
> >>>>>> I will see how possible a new, all bare metal cluster is, but we
> >>>>>> have 5
> >>>>>> other similarly configured clusters with no troubles. We only have an
> >>>>>> issue when Phoenix does a full scan of some tables that a handler
> >>>>>> gets
> >>>>>> stuck in
> >>>>>>
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
> >>>>>> after it has returned all of its data.
> >>>>>>
> >>>>>> thank you,
> >>>>>>
> >>>>>> -chris
> >>>>>>
> >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> i'm just guessing but may be region servers are stuck because they
> >>>>>>> can't
> >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
> >>>>>>> talk
> >>>>>>> to
> >>>>>>> namenode because of some vmware networking quirk and some internal
> >>>>>>> thread
> >>>>>>> is hanging, who knows,
> >>>>>>> it would take 30 min to setup up bare metal cluster vs days
> >>>>>>> debugging
> >>>>>>> this
> >>>>>>> hybrid setup
> >>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
> >>>>>>> eliminate additional variables
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
> >>>>>>> <c...@biotiquesystems.com>wrote:
> >>>>>>>
> >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
> >>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
> >>>>>>>>
> >>>>>>>> -chris
> >>>>>>>>
> >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> "with 3 VMWare "master" nodes"
> >>>>>>>>>
> >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
> >>>>>>>>> using
> >>>>>>>>> virtual machines
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
> >>>>>>>>> <c...@biotiquesystems.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
> >>>>>>>>>> anyone
> >>>>>>>>>> had
> >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
> >>>>>>>>>> right
> >>>>>>>>>> after a select * on the exact same table will.
> >>>>>>>>>>
> >>>>>>>>>> If we export the table and then re-import it into a new table,
> >>>>>>>>>> the
> >>>>>>>>>> new
> >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
> >>>>>>>> upsert..select
> >>>>>>>>>> to do a copy. It seems something happens to the last region to
> >>>>>>>>>> cause
> >>>>>>>> this,
> >>>>>>>>>> but it is not directly data dependent. Moving the region to
> >>>>>>>>>> another
> >>>>>>>>>> regionserver doesn't have any effect - just moves where the
> >>>>>>>>>> problem
> >>>>>>>>>> happens. Major compactions get hung up by the running threads as
> >>>>>>>>>> they
> >>>>>>>>>> probably have a lock.
> >>>>>>>>>>
> >>>>>>>>>> I've run the hfile tool on the final region and nothing seems
> >>>>>>>>>> awry.
> >>>>>>>>>>
> >>>>>>>>>> Figuring this out will allow this project to continue, as of now
> >>>>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>> hung up on this issue.
> >>>>>>>>>>
> >>>>>>>>>> thank you,
> >>>>>>>>>> -chris
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
> >>>>>>>>>> <c...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
> >>>>>>>>>>> up
> >>>>>>>>>>> with
> >>>>>>>>>> where the RpcServer.handler appears to be stuck:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
> >>>>>>>>>> Fi
> >>>>>>>>>> le
> >>>>>>>>>> .
> >>>>>>>>>> That is the last deepest method in all of the traces, either
> >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
> >>>>>>>>>> for a
> >>>>>>>>>> thread (which has been running for the last couple hours):
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
> >>>>>>>> t(
> >>>>>>>> Fa
> >>>>>>>> stDiffDeltaEncoder.java:540)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
> >>>>>>>> dE
> >>>>>>>> nc
> >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
> >>>>>>>> (H
> >>>>>>>> Fi
> >>>>>>>> leReaderV2.java:1063)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
> >>>>>>>> St
> >>>>>>>> or
> >>>>>>>> e.java:1776)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1722)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
> >>>>>>>> 19
> >>>>>>>> 44
> >>>>>>>> )
> >>>>>>>>>>>  at
> >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
> >>>>>>>> ot
> >>>>>>>> Bl
> >>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
> >>>>>>>> ad
> >>>>>>>> Da
> >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
> >>>>>>>> ek
> >>>>>>>> To
> >>>>>>>> DataBlock(HFileBlockIndex.java:179)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
> >>>>>>>> kB
> >>>>>>>> ef
> >>>>>>>> ore(HFileReaderV2.java:548)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1712)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
> >>>>>>>>>>> <jzh...@hortonworks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hey Chris,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
> >>>>>>>>>>>> in
> >>>>>>>>>>>> one
> >>>>>>>>>>>> table, run the following query from sqlline.py and everything
> >>>>>>>>>>>> seems
> >>>>>>>> fine
> >>>>>>>>>>>> without seeing CPU running hot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.166 seconds)
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.167 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
> >>>>>>>>>>>> goes?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://pastebin.com/w6mtabag
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> thanks again,
> >>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chris,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
> >>>>>>>>>>>>>> pastebin it
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>> place?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Nick
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>>>>>>>>>> <c...@biotiquesystems.com>wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
> >>>>>>>>>>>>>>> have
> >>>>>>>> been
> >>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
> >>>>>>>>>>>>>>> narrowed
> >>>>>>>>>>>>>>> down
> >>>>>>>>>> what
> >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
> >>>>>>>>>>>>>>> selects
> >>>>>>>> one
> >>>>>>>>>>>>>>> row
> >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
> >>>>>>>>>>>>>>> the
> >>>>>>>> number
> >>>>>>>>>>>>>>> of RPC
> >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
> >>>>>>>> count(*)
> >>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
> >>>>>>>>>>>>>>> non-primary
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
> >>>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>> increases
> >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
> >>>>>>>>>>>>>>> ~105%.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>>>>>>>>>> RpcServer.handler
> >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
> >>>>>>>>>>>>>>> usage.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
> >>>>>>>>>>>>>>> monitored
> >>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
> >>>>>>>>>>>>>>> Used
> >>>>>>>> heap
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> 445.2M.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
> >>>>>>>>>>>>>>> turned on
> >>>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
> >>>>>>>>>>>>>>> statement
> >>>>>>>>>> actually
> >>>>>>>>>>>>>>> used up the last available handler).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
> >>>>>>>>>>>>>>> <jzh...@hortonworks.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
> >>>>>>>>>>>>>>>> handler
> >>>>>>>> number
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
> >>>>>>>>>>>>>>>> search
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
> >>>>>>>>>> er
> >>>>>>>>>> .j
> >>>>>>>>>> av
> >>>>>>>>>>>>>>> a:2
> >>>>>>>>>>>>>>>> 841).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
> >>>>>>>>>>>>>>>> some
> >>>>>>>> log
> >>>>>>>>>>>>>>>> info
> >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
> >>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>> of
> >>>>>>>> rpc
> >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
> >>>>>>>>>>>>>>>> issue.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
> >>>>>>>>>>>>>>>> <c...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
> >>>>>>>>>>>>>>>>> usage
> >>>>>>>>>>>>>>>>> spiking
> >>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>> (with
> >>>>>>>>>> 3
> >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
> >>>>>>>>>>>>>>>>> plain
> >>>>>>>>>>>>>>>>> Ambari
> >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
> >>>>>>>>>>>>>>>>> release
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>> hand
> >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
> >>>>>>>>>>>>>>>>> going
> >>>>>>>>>> well
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
> >>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
> >>>>>>>>>>>>>>>>> became
> >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
> >>>>>>>>>>>>>>>>> (12
> >>>>>>>> core
> >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
> >>>>>>>>>>>>>>>>> (initially
> >>>>>>>> we
> >>>>>>>>>>>>>>>>> saw
> >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> all
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> able
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
> >>>>>>>>>>>>>>>>> again.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
> >>>>>>>>>>>>>>>>> and
> >>>>>>>> started
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
> >>>>>>>>>>>>>>>>> attached a
> >>>>>>>> jstack
> >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> shed
> >>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> light.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
> >>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> entity
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>>>>>>>>>> confidential,
> >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> reader
> >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
> >>>>>>>>>>>>>>>> hereby
> >>>>>>>>>> notified
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
> >>>>>>>>>>>>>>>> disclosure
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
> >>>>>>>>>>>>>>>> you
> >>>>>>>> have
> >>>>>>>>>>>>>>>> received this communication in error, please contact the
> >>>>>>>>>>>>>>>> sender
> >>>>>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
> >>>>>>>>>>>> or
> >>>>>>>>>> entity to
> >>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
> >>>>>>>>>>>> the
> >>>>>>>>>> reader
> >>>>>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>>>>>>>> notified
> >>>>>>>>>> that
> >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
> >>>>>>>>>>>> or
> >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
> >>>>>>>>>>>> have
> >>>>>>>>>>>> received this communication in error, please contact the sender
> >>>>>>>>>> immediately
> >>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> CONFIDENTIALITY NOTICE
> >>>>> NOTICE: This message is intended for the use of the individual or
> >>>>> entity to
> >>>>> which it is addressed and may contain information that is
> >>>>> confidential,
> >>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>> reader
> >>>>> of this message is not the intended recipient, you are hereby notified
> >>>>> that
> >>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>> and delete it from your system. Thank You.
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> >>> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> >>> reader
> >>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> >>> immediately
> >>> and delete it from your system. Thank You.
> >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender immediately
> > and delete it from your system. Thank You.
> 
> 
> 
> 
> -- 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Reply via email to