I did some "poor mans" profiling with multiple jstack and came up with where the RpcServer.handler appears to be stuck: org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. That is the last deepest method in all of the traces, either HStore.java:1712 or HStore.java:1722. Here are two example traces for a thread (which has been running for the last couple hours):
"RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540) at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063) at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776) at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722) at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) at java.lang.Thread.run(Thread.java:744) "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944) at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548) at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712) at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) at java.lang.Thread.run(Thread.java:744) On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: > > Hey Chris, > > I used performance.py tool which created a table with 50K rows in one > table, run the following query from sqlline.py and everything seems fine > without seeing CPU running hot. > > 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from > PERFORMANCE_50000; > +------------+ > | COUNT(1) | > +------------+ > | 50000 | > +------------+ > 1 row selected (0.166 seconds) > 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from > PERFORMANCE_50000; > +------------+ > | COUNT(1) | > +------------+ > | 50000 | > +------------+ > 1 row selected (0.167 seconds) > > Is there anyway could you run profiler to see where the CPU goes? > > > > On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >> Ahh, yes. Here is a pastebin for it: >> >> http://pastebin.com/w6mtabag >> >> thanks again, >> -chris >> >> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: >> >>> Hi Chris, >>> >>> Attachments are filtered out by the mail server. Can you pastebin it >>> some >>> place? >>> >>> Thanks, >>> Nick >>> >>> >>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas >>> <c...@biotiquesystems.com>wrote: >>> >>>> Hello, >>>> >>>> We set the HBase RegionServer Handler to 10 (it appears to have been >>>> set >>>> to 60 by Ambari during install process). Now we have narrowed down what >>>> causes the CPU to increase and have some detailed logs: >>>> >>>> If we connect using sqlline.py and execute a select that selects one >>>> row >>>> using the primary_key, no increate in CPU is observed and the number >>>> of RPC >>>> threads in a RUNNABLE state remains the same. >>>> >>>> If we execute a select that scans the table such as "select count(*) >>>> from >>>> TABLE" or where the "where" clause only limits on non-primary key >>>> attributes, then the number of RUNNABLE RpcServer.handler threads >>>> increases >>>> and the CPU utilization of the regionserver increases by ~105%. >>>> >>>> Disconnecting the client does not have an effect and the >>>> RpcServer.handler >>>> thread is left RUNNABLE and the CPU stays at the higher usage. >>>> >>>> Checking the Web Console for the Regionserver just shows 10 >>>> RpcServer.reader tasks, all in a WAITING state, no other monitored >>>> tasks >>>> are happening. The regionserver has a Max Heap of 10G and a Used heap >>>> of >>>> 445.2M. >>>> >>>> I've attached the regionserver log with IPC debug logging turned on >>>> right >>>> when one of the Phoenix statements is executed (this statement actually >>>> used up the last available handler). >>>> >>>> thanks, >>>> -chris >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com> >>>> wrote: >>>> >>>>> >>>>> From the stack, it seems you increase the default rpc handler number >>>>> to >>>>> about 60. All handlers are serving Get request(You can search >>>>> >>>> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav >>>> a:2 >>>>> 841). >>>>> >>>>> You can check why there are so many get requests by adding some log >>>>> info >>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc >>>>> handlers per region server, it will mitigate your current issue. >>>>> >>>>> >>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>>>> >>>>>> We have hit a problem with Phoenix and regionservers CPU usage >>>>>> spiking >>>> up >>>>>> to use all available CPU and becoming unresponsive. >>>>>> >>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3 >>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain >>>>>> Ambari >>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand >>>>>> linked in the jar files to the hadoop lib. Everything was going well >>>>>> and >>>>>> we were able to load in ~30k records into several tables. What >>>>>> happened >>>>>> was after about 3-4 days of being up the regionservers became >>>>>> unresponsive and started to use most of the available CPU (12 core >>>>>> boxes). Nothing terribly informative was in the logs (initially we >>>>>> saw >>>>>> some flush messages that seemed excessive, but that was not all of >>>>>> the >>>>>> time and we changed back to the standard HBase WAL codec). We are >>>>>> able >>>> to >>>>>> kill the unresponsive regionservers and then restart them, the >>>>>> cluster >>>>>> will be fine for a day or so but will start to lock up again. >>>>>> >>>>>> We've dropped the entire HBase and zookeper information and started >>>>>> from >>>>>> scratch, but that has not helped. >>>>>> >>>>>> James Taylor suggested I send this off here. I've attached a jstack >>>>>> report of a locked up regionserver in hopes that someone can shed >>>>>> some >>>>>> light. >>>>>> >>>>>> thanks, >>>>>> -chris >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> CONFIDENTIALITY NOTICE >>>>> NOTICE: This message is intended for the use of the individual or >>>>> entity >>>> to >>>>> which it is addressed and may contain information that is >>>>> confidential, >>>>> privileged and exempt from disclosure under applicable law. If the >>>>> reader >>>>> of this message is not the intended recipient, you are hereby notified >>>> that >>>>> any printing, copying, dissemination, distribution, disclosure or >>>>> forwarding of this communication is strictly prohibited. If you have >>>>> received this communication in error, please contact the sender >>>> immediately >>>>> and delete it from your system. Thank You. >>>> >>>> >>>> > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.