I'll start on building a patched version then trying it out (Now that it is working I'll have to wrangle the cluster away from the users). I cannot send the data, it private data.
thanks, -chris On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: > > Thanks for reporting the status back. Mind to try the patch on phoenix-992 > or attach the data which can repro the issue so that we can verify current > fix. Thanks. > > On 5/21/14 2:06 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >> >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem. >> Just to verify that it was the fix, I altered it back to FAST_DIFF and >> the problem re-appeared. Thank you! >> >> Adding the changes to the client hbase-site.xml did not help the issue, >> but it may be I was not sourcing it correctly in sqlline. I made sure it >> was in the classpath being passed to java via -cp when launching it. >> >> thanks again, >> -chris >> >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: >> >>> >>> >>> Hey Chris, >>> >>> Sorry for the trace option. The right option should be below. Before you >>> were able to turn on DEBUG, you just need to change it to TRACE. >>> >>> log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE >>> >>> >>> >>> Do you see that the two read requests burn one of your CPU to 100%? From >>> the call stacks you posted today and before, it seems there might be an >>> infinite loop inside the function Htable.getRowOrBefore() which is >>> deprecated and supposed to only work for meta region.(brought by Enis). >>> Fortunately there is only one place in phoenix call this function, so >>> could you try the following things? >>> >>> set following configuration in client side hbase-site.xml to disable >>> stats >>> update >>> phoenix.query.statsUpdateFrequency = 2147483647 >>> phoenix.query.maxStatsAge = 0 >>> >>> After the configuration changes, could you verify if the CPU is still >>> spiking for the same query? >>> >>> You can also do the following: >>> >>> Alter your table's ColumnFamily data block encoding to NONE as >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that >>> through >>> Hbase client. >>> >>> >>> We suspect that the combination of your data, FAST_DIFF data block >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue. >>> >>> >>> >>> On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>> >>>> Ho Jeffrey. >>>> >>>> Thank you for taking a look. I added >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the >>>> Custom >>>> log4j.properties in Ambari and did a rolling restart of the >>>> regionservers >>>> (and verified it was pushed out the nodes in >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on >>>> the >>>> table. Looking at the logs I don't see any TRACE messages. The only two >>>> log messages that happened when issuing the select are: >>>> >>>> 2014-05-20 16:43:53,609 INFO [RpcServer.handler=9,port=60020] >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped >>>> coprocessor scan >>>> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x >>>> 00 >>>> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1," >>>> ca >>>> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions >>>> ": >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1} >>>> 2014-05-20 16:43:54,309 INFO [RpcServer.handler=9,port=60020] >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 >>>> rows for ungrouped coprocessor scan >>>> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x >>>> 00 >>>> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1," >>>> ca >>>> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions >>>> ": >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1} >>>> >>>> The metrics for the regionserver show only 2 total read requests since >>>> it >>>> started and 0 per second now. Those two requests are for the table I >>>> did >>>> the count on. The regionserver's UI says it has no active RPC calls. >>>> It >>>> currently has one stuck handler from running that command: >>>> >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 >>>> nid=0x2582 runnable [0x00007fbd5ba6c000] >>>> java.lang.Thread.State: RUNNABLE >>>> at >>>> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java >>>> :2 >>>> 888) >>>> at >>>> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java >>>> :2 >>>> 880) >>>> at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369) >>>> at java.util.TreeMap.getEntry(TreeMap.java:340) >>>> at java.util.TreeMap.get(TreeMap.java:273) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete >>>> d( >>>> GetClosestRowBeforeTracker.java:128) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi >>>> da >>>> te(GetClosestRowBeforeTracker.java:107) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G >>>> et >>>> ClosestRowBeforeTracker.java:202) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor >>>> e. >>>> java:1772) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H >>>> St >>>> ore.java:1722) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j >>>> av >>>> a:1655) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion >>>> .j >>>> ava:1826) >>>> at >>>> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav >>>> a: >>>> 2841) >>>> at >>>> >>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2. >>>> ca >>>> llBlockingMethod(ClientProtos.java:28857) >>>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch >>>> ed >>>> uler.java:160) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched >>>> ul >>>> er.java:38) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler. >>>> ja >>>> va:110) >>>> at java.lang.Thread.run(Thread.java:744) >>>> >>>> It also has one listener running: >>>> >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 >>>> nid=0x2579 runnable [0x00007fbd5c376000] >>>> java.lang.Thread.State: RUNNABLE >>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) >>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) >>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) >>>> - locked <0x000000059ae28988> (a sun.nio.ch.Util$2) >>>> - locked <0x000000059ae289a0> (a >>>> java.util.Collections$UnmodifiableSet) >>>> - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl) >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) >>>> at >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676) >>>> >>>> If there is anything else you need, please let me know. >>>> >>>> thanks again, >>>> -chris >>>> >>>> >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com> >>>> wrote: >>>> >>>>> >>>>> I rechecked your RPC log and found that you didn't turn on the RPC >>>>> trace >>>>> level. You can set the following so that we can see detail request & >>>>> response. >>>>> >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE >>>>> >>>>> If you didn't see time out exception in your region server logs, it >>>>> means >>>>> that there are something keeping issuing lots of Get requests. Have >>>>> you >>>>> used secondary index for this table? If that's the case, please >>>>> disable/drop it to see if that helps. >>>>> >>>>> >>>>> >>>>> >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>>>> >>>>>> I will see how possible a new, all bare metal cluster is, but we >>>>>> have 5 >>>>>> other similarly configured clusters with no troubles. We only have an >>>>>> issue when Phoenix does a full scan of some tables that a handler >>>>>> gets >>>>>> stuck in >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile >>>>>> after it has returned all of its data. >>>>>> >>>>>> thank you, >>>>>> >>>>>> -chris >>>>>> >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> i'm just guessing but may be region servers are stuck because they >>>>>>> can't >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't >>>>>>> talk >>>>>>> to >>>>>>> namenode because of some vmware networking quirk and some internal >>>>>>> thread >>>>>>> is hanging, who knows, >>>>>>> it would take 30 min to setup up bare metal cluster vs days >>>>>>> debugging >>>>>>> this >>>>>>> hybrid setup >>>>>>> also I'd suggest to install directly from hbase/hadoop websites to >>>>>>> eliminate additional variables >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas >>>>>>> <c...@biotiquesystems.com>wrote: >>>>>>> >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The >>>>>>>> datanodes/regionservers with the stuck processes all bare metal. >>>>>>>> >>>>>>>> -chris >>>>>>>> >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>>> "with 3 VMWare "master" nodes" >>>>>>>>> >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of >>>>>>>>> using >>>>>>>>> virtual machines >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas >>>>>>>>> <c...@biotiquesystems.com >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if >>>>>>>>>> anyone >>>>>>>>>> had >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but >>>>>>>>>> right >>>>>>>>>> after a select * on the exact same table will. >>>>>>>>>> >>>>>>>>>> If we export the table and then re-import it into a new table, >>>>>>>>>> the >>>>>>>>>> new >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an >>>>>>>> upsert..select >>>>>>>>>> to do a copy. It seems something happens to the last region to >>>>>>>>>> cause >>>>>>>> this, >>>>>>>>>> but it is not directly data dependent. Moving the region to >>>>>>>>>> another >>>>>>>>>> regionserver doesn't have any effect - just moves where the >>>>>>>>>> problem >>>>>>>>>> happens. Major compactions get hung up by the running threads as >>>>>>>>>> they >>>>>>>>>> probably have a lock. >>>>>>>>>> >>>>>>>>>> I've run the hfile tool on the final region and nothing seems >>>>>>>>>> awry. >>>>>>>>>> >>>>>>>>>> Figuring this out will allow this project to continue, as of now >>>>>>>>>> it >>>>>>>>>> is >>>>>>>>>> hung up on this issue. >>>>>>>>>> >>>>>>>>>> thank you, >>>>>>>>>> -chris >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas >>>>>>>>>> <c...@biotiquesystems.com> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came >>>>>>>>>>> up >>>>>>>>>>> with >>>>>>>>>> where the RpcServer.handler appears to be stuck: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore >>>>>>>>>> Fi >>>>>>>>>> le >>>>>>>>>> . >>>>>>>>>> That is the last deepest method in all of the traces, either >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces >>>>>>>>>> for a >>>>>>>>>> thread (which has been running for the last couple hours): >>>>>>>>>>> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 >>>>>>>>>>> tid=0x0000000000cdb800 >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>>>>>>>> java.lang.Thread.State: RUNNABLE >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex >>>>>>>> t( >>>>>>>> Fa >>>>>>>> stDiffDeltaEncoder.java:540) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere >>>>>>>> dE >>>>>>>> nc >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next >>>>>>>> (H >>>>>>>> Fi >>>>>>>> leReaderV2.java:1063) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H >>>>>>>> St >>>>>>>> or >>>>>>>> e.java:1776) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi >>>>>>>> le >>>>>>>> (H >>>>>>>> Store.java:1722) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto >>>>>>>> re >>>>>>>> .j >>>>>>>> ava:1655) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe >>>>>>>> gi >>>>>>>> on >>>>>>>> .java:1826) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer >>>>>>>> .j >>>>>>>> av >>>>>>>> a:2841) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic >>>>>>>> e$ >>>>>>>> 2. >>>>>>>> callBlockingMethod(ClientProtos.java:28857) >>>>>>>>>>> at >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>>>>>>>> at >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp >>>>>>>> cS >>>>>>>> ch >>>>>>>> eduler.java:160) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS >>>>>>>> ch >>>>>>>> ed >>>>>>>> uler.java:38) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu >>>>>>>> le >>>>>>>> r. >>>>>>>> java:110) >>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>>>>> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 >>>>>>>>>>> tid=0x0000000000cdb800 >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>>>>>>>> java.lang.Thread.State: RUNNABLE >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java: >>>>>>>> 19 >>>>>>>> 44 >>>>>>>> ) >>>>>>>>>>> at >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro >>>>>>>> ot >>>>>>>> Bl >>>>>>>> ockContainingKey(HFileBlockIndex.java:392) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo >>>>>>>> ad >>>>>>>> Da >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se >>>>>>>> ek >>>>>>>> To >>>>>>>> DataBlock(HFileBlockIndex.java:179) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see >>>>>>>> kB >>>>>>>> ef >>>>>>>> ore(HFileReaderV2.java:548) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi >>>>>>>> le >>>>>>>> (H >>>>>>>> Store.java:1712) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto >>>>>>>> re >>>>>>>> .j >>>>>>>> ava:1655) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe >>>>>>>> gi >>>>>>>> on >>>>>>>> .java:1826) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer >>>>>>>> .j >>>>>>>> av >>>>>>>> a:2841) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic >>>>>>>> e$ >>>>>>>> 2. >>>>>>>> callBlockingMethod(ClientProtos.java:28857) >>>>>>>>>>> at >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>>>>>>>> at >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp >>>>>>>> cS >>>>>>>> ch >>>>>>>> eduler.java:160) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS >>>>>>>> ch >>>>>>>> ed >>>>>>>> uler.java:38) >>>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu >>>>>>>> le >>>>>>>> r. >>>>>>>> java:110) >>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong >>>>>>>>>>> <jzh...@hortonworks.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hey Chris, >>>>>>>>>>>> >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows >>>>>>>>>>>> in >>>>>>>>>>>> one >>>>>>>>>>>> table, run the following query from sqlline.py and everything >>>>>>>>>>>> seems >>>>>>>> fine >>>>>>>>>>>> without seeing CPU running hot. >>>>>>>>>>>> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) >>>>>>>>>>>> from >>>>>>>>>>>> PERFORMANCE_50000; >>>>>>>>>>>> +------------+ >>>>>>>>>>>> | COUNT(1) | >>>>>>>>>>>> +------------+ >>>>>>>>>>>> | 50000 | >>>>>>>>>>>> +------------+ >>>>>>>>>>>> 1 row selected (0.166 seconds) >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) >>>>>>>>>>>> from >>>>>>>>>>>> PERFORMANCE_50000; >>>>>>>>>>>> +------------+ >>>>>>>>>>>> | COUNT(1) | >>>>>>>>>>>> +------------+ >>>>>>>>>>>> | 50000 | >>>>>>>>>>>> +------------+ >>>>>>>>>>>> 1 row selected (0.167 seconds) >>>>>>>>>>>> >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU >>>>>>>>>>>> goes? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it: >>>>>>>>>>>>> >>>>>>>>>>>>> http://pastebin.com/w6mtabag >>>>>>>>>>>>> >>>>>>>>>>>>> thanks again, >>>>>>>>>>>>> -chris >>>>>>>>>>>>> >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> >>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Chris, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you >>>>>>>>>>>>>> pastebin it >>>>>>>>>>>>>> some >>>>>>>>>>>>>> place? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Nick >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas >>>>>>>>>>>>>> <c...@biotiquesystems.com>wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to >>>>>>>>>>>>>>> have >>>>>>>> been >>>>>>>>>>>>>>> set >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have >>>>>>>>>>>>>>> narrowed >>>>>>>>>>>>>>> down >>>>>>>>>> what >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that >>>>>>>>>>>>>>> selects >>>>>>>> one >>>>>>>>>>>>>>> row >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and >>>>>>>>>>>>>>> the >>>>>>>> number >>>>>>>>>>>>>>> of RPC >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select >>>>>>>> count(*) >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on >>>>>>>>>>>>>>> non-primary >>>>>>>>>>>>>>> key >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler >>>>>>>>>>>>>>> threads >>>>>>>>>>>>>>> increases >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by >>>>>>>>>>>>>>> ~105%. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the >>>>>>>>>>>>>>> RpcServer.handler >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher >>>>>>>>>>>>>>> usage. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10 >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other >>>>>>>>>>>>>>> monitored >>>>>>>>>>>>>>> tasks >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a >>>>>>>>>>>>>>> Used >>>>>>>> heap >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>> 445.2M. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging >>>>>>>>>>>>>>> turned on >>>>>>>>>>>>>>> right >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this >>>>>>>>>>>>>>> statement >>>>>>>>>> actually >>>>>>>>>>>>>>> used up the last available handler). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>> -chris >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong >>>>>>>>>>>>>>> <jzh...@hortonworks.com >>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc >>>>>>>>>>>>>>>> handler >>>>>>>> number >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can >>>>>>>>>>>>>>>> search >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ >>>>>>>>>> er >>>>>>>>>> .j >>>>>>>>>> av >>>>>>>>>>>>>>> a:2 >>>>>>>>>>>>>>>> 841). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding >>>>>>>>>>>>>>>> some >>>>>>>> log >>>>>>>>>>>>>>>> info >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the >>>>>>>>>>>>>>>> number >>>>>>>>>>>>>>>> of >>>>>>>> rpc >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current >>>>>>>>>>>>>>>> issue. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" >>>>>>>>>>>>>>>> <c...@biotiquesystems.com> >>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU >>>>>>>>>>>>>>>>> usage >>>>>>>>>>>>>>>>> spiking >>>>>>>>>>>>>>> up >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node >>>>>>>>>>>>>>>>> cluster >>>>>>>> (with >>>>>>>>>> 3 >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a >>>>>>>>>>>>>>>>> plain >>>>>>>>>>>>>>>>> Ambari >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM >>>>>>>>>>>>>>>>> release >>>>>>>>>>>>>>>>> and >>>>>>>>>> hand >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was >>>>>>>>>>>>>>>>> going >>>>>>>>>> well >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables. >>>>>>>>>>>>>>>>> What >>>>>>>>>>>>>>>>> happened >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers >>>>>>>>>>>>>>>>> became >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU >>>>>>>>>>>>>>>>> (12 >>>>>>>> core >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs >>>>>>>>>>>>>>>>> (initially >>>>>>>> we >>>>>>>>>>>>>>>>> saw >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was >>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>> all >>>>>>>> of >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). >>>>>>>>>>>>>>>>> We >>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> able >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> cluster >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up >>>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information >>>>>>>>>>>>>>>>> and >>>>>>>> started >>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>> scratch, but that has not helped. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've >>>>>>>>>>>>>>>>> attached a >>>>>>>> jstack >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone >>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> shed >>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>> light. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>>> -chris >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the >>>>>>>>>>>>>>>> individual >>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> entity >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> which it is addressed and may contain information that is >>>>>>>>>>>>>>>> confidential, >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. >>>>>>>>>>>>>>>> If >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> reader >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are >>>>>>>>>>>>>>>> hereby >>>>>>>>>> notified >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution, >>>>>>>>>>>>>>>> disclosure >>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If >>>>>>>>>>>>>>>> you >>>>>>>> have >>>>>>>>>>>>>>>> received this communication in error, please contact the >>>>>>>>>>>>>>>> sender >>>>>>>>>>>>>>> immediately >>>>>>>>>>>>>>>> and delete it from your system. Thank You. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual >>>>>>>>>>>> or >>>>>>>>>> entity to >>>>>>>>>>>> which it is addressed and may contain information that is >>>>>>>> confidential, >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If >>>>>>>>>>>> the >>>>>>>>>> reader >>>>>>>>>>>> of this message is not the intended recipient, you are hereby >>>>>>>>>>>> notified >>>>>>>>>> that >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure >>>>>>>>>>>> or >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you >>>>>>>>>>>> have >>>>>>>>>>>> received this communication in error, please contact the sender >>>>>>>>>> immediately >>>>>>>>>>>> and delete it from your system. Thank You. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> CONFIDENTIALITY NOTICE >>>>> NOTICE: This message is intended for the use of the individual or >>>>> entity to >>>>> which it is addressed and may contain information that is >>>>> confidential, >>>>> privileged and exempt from disclosure under applicable law. If the >>>>> reader >>>>> of this message is not the intended recipient, you are hereby notified >>>>> that >>>>> any printing, copying, dissemination, distribution, disclosure or >>>>> forwarding of this communication is strictly prohibited. If you have >>>>> received this communication in error, please contact the sender >>>>> immediately >>>>> and delete it from your system. Thank You. >>>> >>> >>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or >>> entity to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the >>> reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >> > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.