Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem. Just to verify that it was the fix, I altered it back to FAST_DIFF and the problem re-appeared. Thank you!
Adding the changes to the client hbase-site.xml did not help the issue, but it may be I was not sourcing it correctly in sqlline. I made sure it was in the classpath being passed to java via -cp when launching it. thanks again, -chris On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: > > > Hey Chris, > > Sorry for the trace option. The right option should be below. Before you > were able to turn on DEBUG, you just need to change it to TRACE. > > log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE > > > > Do you see that the two read requests burn one of your CPU to 100%? From > the call stacks you posted today and before, it seems there might be an > infinite loop inside the function Htable.getRowOrBefore() which is > deprecated and supposed to only work for meta region.(brought by Enis). > Fortunately there is only one place in phoenix call this function, so > could you try the following things? > > set following configuration in client side hbase-site.xml to disable stats > update > phoenix.query.statsUpdateFrequency = 2147483647 > phoenix.query.maxStatsAge = 0 > > After the configuration changes, could you verify if the CPU is still > spiking for the same query? > > You can also do the following: > > Alter your table's ColumnFamily data block encoding to NONE as > DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through > Hbase client. > > > We suspect that the combination of your data, FAST_DIFF data block > encoding & the deprecated Htable.getRowOrBefore() cause the issue. > > > > On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >> Ho Jeffrey. >> >> Thank you for taking a look. I added >> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom >> log4j.properties in Ambari and did a rolling restart of the regionservers >> (and verified it was pushed out the nodes in >> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the >> table. Looking at the logs I don't see any TRACE messages. The only two >> log messages that happened when issuing the select are: >> >> 2014-05-20 16:43:53,609 INFO [RpcServer.handler=9,port=60020] >> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped >> coprocessor scan >> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00 >> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca >> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions": >> 1,"filter":"FirstKeyOnlyFilter","caching":-1} >> 2014-05-20 16:43:54,309 INFO [RpcServer.handler=9,port=60020] >> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 >> rows for ungrouped coprocessor scan >> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00 >> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca >> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions": >> 1,"filter":"FirstKeyOnlyFilter","caching":-1} >> >> The metrics for the regionserver show only 2 total read requests since it >> started and 0 per second now. Those two requests are for the table I did >> the count on. The regionserver's UI says it has no active RPC calls. It >> currently has one stuck handler from running that command: >> >> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 >> nid=0x2582 runnable [0x00007fbd5ba6c000] >> java.lang.Thread.State: RUNNABLE >> at >> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2 >> 888) >> at >> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2 >> 880) >> at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369) >> at java.util.TreeMap.getEntry(TreeMap.java:340) >> at java.util.TreeMap.get(TreeMap.java:273) >> at >> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted( >> GetClosestRowBeforeTracker.java:128) >> at >> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida >> te(GetClosestRowBeforeTracker.java:107) >> at >> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get >> ClosestRowBeforeTracker.java:202) >> at >> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore. >> java:1772) >> at >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt >> ore.java:1722) >> at >> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav >> a:1655) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j >> ava:1826) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java: >> 2841) >> at >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca >> llBlockingMethod(ClientProtos.java:28857) >> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched >> uler.java:160) >> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul >> er.java:38) >> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja >> va:110) >> at java.lang.Thread.run(Thread.java:744) >> >> It also has one listener running: >> >> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 >> nid=0x2579 runnable [0x00007fbd5c376000] >> java.lang.Thread.State: RUNNABLE >> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) >> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) >> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) >> - locked <0x000000059ae28988> (a sun.nio.ch.Util$2) >> - locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet) >> - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl) >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) >> at >> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676) >> >> If there is anything else you need, please let me know. >> >> thanks again, >> -chris >> >> >> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: >> >>> >>> I rechecked your RPC log and found that you didn't turn on the RPC trace >>> level. You can set the following so that we can see detail request & >>> response. >>> >>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE >>> >>> If you didn't see time out exception in your region server logs, it >>> means >>> that there are something keeping issuing lots of Get requests. Have you >>> used secondary index for this table? If that's the case, please >>> disable/drop it to see if that helps. >>> >>> >>> >>> >>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>> >>>> I will see how possible a new, all bare metal cluster is, but we have 5 >>>> other similarly configured clusters with no troubles. We only have an >>>> issue when Phoenix does a full scan of some tables that a handler gets >>>> stuck in >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile >>>> after it has returned all of its data. >>>> >>>> thank you, >>>> >>>> -chris >>>> >>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> wrote: >>>> >>>>> i'm just guessing but may be region servers are stuck because they >>>>> can't >>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk >>>>> to >>>>> namenode because of some vmware networking quirk and some internal >>>>> thread >>>>> is hanging, who knows, >>>>> it would take 30 min to setup up bare metal cluster vs days debugging >>>>> this >>>>> hybrid setup >>>>> also I'd suggest to install directly from hbase/hadoop websites to >>>>> eliminate additional variables >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas >>>>> <c...@biotiquesystems.com>wrote: >>>>> >>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The >>>>>> datanodes/regionservers with the stuck processes all bare metal. >>>>>> >>>>>> -chris >>>>>> >>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>>> "with 3 VMWare "master" nodes" >>>>>>> >>>>>>> can you try running hbase/phoenix on physical nodes instead of using >>>>>>> virtual machines >>>>>>> >>>>>>> >>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas >>>>>>> <c...@biotiquesystems.com >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry to follow up on my own message, but I was wondering if anyone >>>>>>>> had >>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but >>>>>>>> right >>>>>>>> after a select * on the exact same table will. >>>>>>>> >>>>>>>> If we export the table and then re-import it into a new table, the >>>>>>>> new >>>>>>>> table doesn't exhibit these symptoms, same as if we use an >>>>>> upsert..select >>>>>>>> to do a copy. It seems something happens to the last region to >>>>>>>> cause >>>>>> this, >>>>>>>> but it is not directly data dependent. Moving the region to another >>>>>>>> regionserver doesn't have any effect - just moves where the problem >>>>>>>> happens. Major compactions get hung up by the running threads as >>>>>>>> they >>>>>>>> probably have a lock. >>>>>>>> >>>>>>>> I've run the hfile tool on the final region and nothing seems awry. >>>>>>>> >>>>>>>> Figuring this out will allow this project to continue, as of now it >>>>>>>> is >>>>>>>> hung up on this issue. >>>>>>>> >>>>>>>> thank you, >>>>>>>> -chris >>>>>>>> >>>>>>>> >>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> >>>>>> wrote: >>>>>>>> >>>>>>>>> I did some "poor mans" profiling with multiple jstack and came up >>>>>>>>> with >>>>>>>> where the RpcServer.handler appears to be stuck: >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi >>>>>>>> le >>>>>>>> . >>>>>>>> That is the last deepest method in all of the traces, either >>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces >>>>>>>> for a >>>>>>>> thread (which has been running for the last couple hours): >>>>>>>>> >>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 >>>>>>>>> tid=0x0000000000cdb800 >>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>>>>>> java.lang.Thread.State: RUNNABLE >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext( >>>>>> Fa >>>>>> stDiffDeltaEncoder.java:540) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE >>>>>> nc >>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H >>>>>> Fi >>>>>> leReaderV2.java:1063) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt >>>>>> or >>>>>> e.java:1776) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile >>>>>> (H >>>>>> Store.java:1722) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore >>>>>> .j >>>>>> ava:1655) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi >>>>>> on >>>>>> .java:1826) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j >>>>>> av >>>>>> a:2841) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$ >>>>>> 2. >>>>>> callBlockingMethod(ClientProtos.java:28857) >>>>>>>>> at >>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>>>>>> at >>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS >>>>>> ch >>>>>> eduler.java:160) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch >>>>>> ed >>>>>> uler.java:38) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule >>>>>> r. >>>>>> java:110) >>>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>>> >>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 >>>>>>>>> tid=0x0000000000cdb800 >>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>>>>>> java.lang.Thread.State: RUNNABLE >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19 >>>>>> 44 >>>>>> ) >>>>>>>>> at >>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root >>>>>> Bl >>>>>> ockContainingKey(HFileBlockIndex.java:392) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load >>>>>> Da >>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek >>>>>> To >>>>>> DataBlock(HFileBlockIndex.java:179) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB >>>>>> ef >>>>>> ore(HFileReaderV2.java:548) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile >>>>>> (H >>>>>> Store.java:1712) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore >>>>>> .j >>>>>> ava:1655) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi >>>>>> on >>>>>> .java:1826) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j >>>>>> av >>>>>> a:2841) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$ >>>>>> 2. >>>>>> callBlockingMethod(ClientProtos.java:28857) >>>>>>>>> at >>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>>>>>> at >>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS >>>>>> ch >>>>>> eduler.java:160) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch >>>>>> ed >>>>>> uler.java:38) >>>>>>>>> at >>>>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule >>>>>> r. >>>>>> java:110) >>>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>>> >>>>>>>>> >>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong >>>>>>>>> <jzh...@hortonworks.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hey Chris, >>>>>>>>>> >>>>>>>>>> I used performance.py tool which created a table with 50K rows in >>>>>>>>>> one >>>>>>>>>> table, run the following query from sqlline.py and everything >>>>>>>>>> seems >>>>>> fine >>>>>>>>>> without seeing CPU running hot. >>>>>>>>>> >>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>>>>>>>> PERFORMANCE_50000; >>>>>>>>>> +------------+ >>>>>>>>>> | COUNT(1) | >>>>>>>>>> +------------+ >>>>>>>>>> | 50000 | >>>>>>>>>> +------------+ >>>>>>>>>> 1 row selected (0.166 seconds) >>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>>>>>>>> PERFORMANCE_50000; >>>>>>>>>> +------------+ >>>>>>>>>> | COUNT(1) | >>>>>>>>>> +------------+ >>>>>>>>>> | 50000 | >>>>>>>>>> +------------+ >>>>>>>>>> 1 row selected (0.167 seconds) >>>>>>>>>> >>>>>>>>>> Is there anyway could you run profiler to see where the CPU goes? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Ahh, yes. Here is a pastebin for it: >>>>>>>>>>> >>>>>>>>>>> http://pastebin.com/w6mtabag >>>>>>>>>>> >>>>>>>>>>> thanks again, >>>>>>>>>>> -chris >>>>>>>>>>> >>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> >>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Chris, >>>>>>>>>>>> >>>>>>>>>>>> Attachments are filtered out by the mail server. Can you >>>>>>>>>>>> pastebin it >>>>>>>>>>>> some >>>>>>>>>>>> place? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Nick >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas >>>>>>>>>>>> <c...@biotiquesystems.com>wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to >>>>>>>>>>>>> have >>>>>> been >>>>>>>>>>>>> set >>>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed >>>>>>>>>>>>> down >>>>>>>> what >>>>>>>>>>>>> causes the CPU to increase and have some detailed logs: >>>>>>>>>>>>> >>>>>>>>>>>>> If we connect using sqlline.py and execute a select that >>>>>>>>>>>>> selects >>>>>> one >>>>>>>>>>>>> row >>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the >>>>>> number >>>>>>>>>>>>> of RPC >>>>>>>>>>>>> threads in a RUNNABLE state remains the same. >>>>>>>>>>>>> >>>>>>>>>>>>> If we execute a select that scans the table such as "select >>>>>> count(*) >>>>>>>>>>>>> from >>>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary >>>>>>>>>>>>> key >>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler >>>>>>>>>>>>> threads >>>>>>>>>>>>> increases >>>>>>>>>>>>> and the CPU utilization of the regionserver increases by >>>>>>>>>>>>> ~105%. >>>>>>>>>>>>> >>>>>>>>>>>>> Disconnecting the client does not have an effect and the >>>>>>>>>>>>> RpcServer.handler >>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage. >>>>>>>>>>>>> >>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10 >>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other >>>>>>>>>>>>> monitored >>>>>>>>>>>>> tasks >>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a >>>>>>>>>>>>> Used >>>>>> heap >>>>>>>>>>>>> of >>>>>>>>>>>>> 445.2M. >>>>>>>>>>>>> >>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging >>>>>>>>>>>>> turned on >>>>>>>>>>>>> right >>>>>>>>>>>>> when one of the Phoenix statements is executed (this statement >>>>>>>> actually >>>>>>>>>>>>> used up the last available handler). >>>>>>>>>>>>> >>>>>>>>>>>>> thanks, >>>>>>>>>>>>> -chris >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong >>>>>>>>>>>>> <jzh...@hortonworks.com >>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler >>>>>> number >>>>>>>>>>>>>> to >>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer >>>>>>>> .j >>>>>>>> av >>>>>>>>>>>>> a:2 >>>>>>>>>>>>>> 841). >>>>>>>>>>>>>> >>>>>>>>>>>>>> You can check why there are so many get requests by adding >>>>>>>>>>>>>> some >>>>>> log >>>>>>>>>>>>>> info >>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number >>>>>>>>>>>>>> of >>>>>> rpc >>>>>>>>>>>>>> handlers per region server, it will mitigate your current >>>>>>>>>>>>>> issue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> >>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU >>>>>>>>>>>>>>> usage >>>>>>>>>>>>>>> spiking >>>>>>>>>>>>> up >>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster >>>>>> (with >>>>>>>> 3 >>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a >>>>>>>>>>>>>>> plain >>>>>>>>>>>>>>> Ambari >>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release >>>>>>>>>>>>>>> and >>>>>>>> hand >>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was >>>>>>>>>>>>>>> going >>>>>>>> well >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> we were able to load in ~30k records into several tables. >>>>>>>>>>>>>>> What >>>>>>>>>>>>>>> happened >>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers >>>>>>>>>>>>>>> became >>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU >>>>>>>>>>>>>>> (12 >>>>>> core >>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs >>>>>>>>>>>>>>> (initially >>>>>> we >>>>>>>>>>>>>>> saw >>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not >>>>>>>>>>>>>>> all >>>>>> of >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). >>>>>>>>>>>>>>> We >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> able >>>>>>>>>>>>> to >>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> cluster >>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up >>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and >>>>>> started >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> scratch, but that has not helped. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a >>>>>> jstack >>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can >>>>>>>>>>>>>>> shed >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>> light. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>> -chris >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>>>>>> NOTICE: This message is intended for the use of the >>>>>>>>>>>>>> individual >>>>>>>>>>>>>> or >>>>>>>>>>>>>> entity >>>>>>>>>>>>> to >>>>>>>>>>>>>> which it is addressed and may contain information that is >>>>>>>>>>>>>> confidential, >>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. >>>>>>>>>>>>>> If >>>>>>>>>>>>>> the >>>>>>>>>>>>>> reader >>>>>>>>>>>>>> of this message is not the intended recipient, you are hereby >>>>>>>> notified >>>>>>>>>>>>> that >>>>>>>>>>>>>> any printing, copying, dissemination, distribution, >>>>>>>>>>>>>> disclosure >>>>>>>>>>>>>> or >>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If >>>>>>>>>>>>>> you >>>>>> have >>>>>>>>>>>>>> received this communication in error, please contact the >>>>>>>>>>>>>> sender >>>>>>>>>>>>> immediately >>>>>>>>>>>>>> and delete it from your system. Thank You. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>> entity to >>>>>>>>>> which it is addressed and may contain information that is >>>>>> confidential, >>>>>>>>>> privileged and exempt from disclosure under applicable law. If >>>>>>>>>> the >>>>>>>> reader >>>>>>>>>> of this message is not the intended recipient, you are hereby >>>>>>>>>> notified >>>>>>>> that >>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or >>>>>>>>>> forwarding of this communication is strictly prohibited. If you >>>>>>>>>> have >>>>>>>>>> received this communication in error, please contact the sender >>>>>>>> immediately >>>>>>>>>> and delete it from your system. Thank You. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>> >>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or >>> entity to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the >>> reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >> > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.