I am trying to re-produce this issue on our sandbox cluster. No matter how I try to insert the same data (random, sorted, reverse sorted, bulk, with upserrts, etc) I don't see this problem. I'll be copying the actual hfiles of the problematic table over next to see if that can do it.
-chris On May 21, 2014, at 3:58 PM, Enis Soztutar <e...@hortonworks.com> wrote: > Hey Chris, > > Thanks for bearing with us. It seems that the FAST_DIFF encoding + your data > schema + use of getRowOrBefore() caused the issue. That is why Jeffrey was > asking for the data. It is fine. getRowOrBefore() will go away anyway > (https://issues.apache.org/jira/browse/HBASE-11230). Jeffrey's patch changes > Phoenix to use a "reverse scan" which should not have an issue with > FAST_DIFF. It would be awesome if you can test it with your data so that we > can see whether it is a problem with reverse lookups in FAST_DIFF encoding, > or a specific problem with getRowOrBefore() > > Thanks for reporting. > Enis > > > On Wed, May 21, 2014 at 3:44 PM, Chris Tarnas <c...@biotiquesystems.com> > wrote: > I'll start on building a patched version then trying it out (Now that it is > working I'll have to wrangle the cluster away from the users). I cannot send > the data, it private data. > > thanks, > -chris > > > On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: > > > > > Thanks for reporting the status back. Mind to try the patch on phoenix-992 > > or attach the data which can repro the issue so that we can verify current > > fix. Thanks. > > > > On 5/21/14 2:06 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > > > >> > >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem. > >> Just to verify that it was the fix, I altered it back to FAST_DIFF and > >> the problem re-appeared. Thank you! > >> > >> Adding the changes to the client hbase-site.xml did not help the issue, > >> but it may be I was not sourcing it correctly in sqlline. I made sure it > >> was in the classpath being passed to java via -cp when launching it. > >> > >> thanks again, > >> -chris > >> > >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzh...@hortonworks.com> wrote: > >> > >>> > >>> > >>> Hey Chris, > >>> > >>> Sorry for the trace option. The right option should be below. Before you > >>> were able to turn on DEBUG, you just need to change it to TRACE. > >>> > >>> log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE > >>> > >>> > >>> > >>> Do you see that the two read requests burn one of your CPU to 100%? From > >>> the call stacks you posted today and before, it seems there might be an > >>> infinite loop inside the function Htable.getRowOrBefore() which is > >>> deprecated and supposed to only work for meta region.(brought by Enis). > >>> Fortunately there is only one place in phoenix call this function, so > >>> could you try the following things? > >>> > >>> set following configuration in client side hbase-site.xml to disable > >>> stats > >>> update > >>> phoenix.query.statsUpdateFrequency = 2147483647 > >>> phoenix.query.maxStatsAge = 0 > >>> > >>> After the configuration changes, could you verify if the CPU is still > >>> spiking for the same query? > >>> > >>> You can also do the following: > >>> > >>> Alter your table's ColumnFamily data block encoding to NONE as > >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that > >>> through > >>> Hbase client. > >>> > >>> > >>> We suspect that the combination of your data, FAST_DIFF data block > >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue. > >>> > >>> > >>> > >>> On 5/20/14 3:03 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >>> > >>>> Ho Jeffrey. > >>>> > >>>> Thank you for taking a look. I added > >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the > >>>> Custom > >>>> log4j.properties in Ambari and did a rolling restart of the > >>>> regionservers > >>>> (and verified it was pushed out the nodes in > >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on > >>>> the > >>>> table. Looking at the logs I don't see any TRACE messages. The only two > >>>> log messages that happened when issuing the select are: > >>>> > >>>> 2014-05-20 16:43:53,609 INFO [RpcServer.handler=9,port=60020] > >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped > >>>> coprocessor scan > >>>> > >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x > >>>> 00 > >>>> > >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1," > >>>> ca > >>>> > >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions > >>>> ": > >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1} > >>>> 2014-05-20 16:43:54,309 INFO [RpcServer.handler=9,port=60020] > >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 > >>>> rows for ungrouped coprocessor scan > >>>> > >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x > >>>> 00 > >>>> > >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1," > >>>> ca > >>>> > >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions > >>>> ": > >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1} > >>>> > >>>> The metrics for the regionserver show only 2 total read requests since > >>>> it > >>>> started and 0 per second now. Those two requests are for the table I > >>>> did > >>>> the count on. The regionserver's UI says it has no active RPC calls. > >>>> It > >>>> currently has one stuck handler from running that command: > >>>> > >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 > >>>> nid=0x2582 runnable [0x00007fbd5ba6c000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java > >>>> :2 > >>>> 888) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java > >>>> :2 > >>>> 880) > >>>> at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369) > >>>> at java.util.TreeMap.getEntry(TreeMap.java:340) > >>>> at java.util.TreeMap.get(TreeMap.java:273) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete > >>>> d( > >>>> GetClosestRowBeforeTracker.java:128) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi > >>>> da > >>>> te(GetClosestRowBeforeTracker.java:107) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G > >>>> et > >>>> ClosestRowBeforeTracker.java:202) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor > >>>> e. > >>>> java:1772) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H > >>>> St > >>>> ore.java:1722) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j > >>>> av > >>>> a:1655) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion > >>>> .j > >>>> ava:1826) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav > >>>> a: > >>>> 2841) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2. > >>>> ca > >>>> llBlockingMethod(ClientProtos.java:28857) > >>>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > >>>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch > >>>> ed > >>>> uler.java:160) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched > >>>> ul > >>>> er.java:38) > >>>> at > >>>> > >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler. > >>>> ja > >>>> va:110) > >>>> at java.lang.Thread.run(Thread.java:744) > >>>> > >>>> It also has one listener running: > >>>> > >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 > >>>> nid=0x2579 runnable [0x00007fbd5c376000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > >>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) > >>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > >>>> - locked <0x000000059ae28988> (a sun.nio.ch.Util$2) > >>>> - locked <0x000000059ae289a0> (a > >>>> java.util.Collections$UnmodifiableSet) > >>>> - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl) > >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) > >>>> at > >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676) > >>>> > >>>> If there is anything else you need, please let me know. > >>>> > >>>> thanks again, > >>>> -chris > >>>> > >>>> > >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzh...@hortonworks.com> > >>>> wrote: > >>>> > >>>>> > >>>>> I rechecked your RPC log and found that you didn't turn on the RPC > >>>>> trace > >>>>> level. You can set the following so that we can see detail request & > >>>>> response. > >>>>> > >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE > >>>>> > >>>>> If you didn't see time out exception in your region server logs, it > >>>>> means > >>>>> that there are something keeping issuing lots of Get requests. Have > >>>>> you > >>>>> used secondary index for this table? If that's the case, please > >>>>> disable/drop it to see if that helps. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >>>>> > >>>>>> I will see how possible a new, all bare metal cluster is, but we > >>>>>> have 5 > >>>>>> other similarly configured clusters with no troubles. We only have an > >>>>>> issue when Phoenix does a full scan of some tables that a handler > >>>>>> gets > >>>>>> stuck in > >>>>>> > >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile > >>>>>> after it has returned all of its data. > >>>>>> > >>>>>> thank you, > >>>>>> > >>>>>> -chris > >>>>>> > >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> i'm just guessing but may be region servers are stuck because they > >>>>>>> can't > >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't > >>>>>>> talk > >>>>>>> to > >>>>>>> namenode because of some vmware networking quirk and some internal > >>>>>>> thread > >>>>>>> is hanging, who knows, > >>>>>>> it would take 30 min to setup up bare metal cluster vs days > >>>>>>> debugging > >>>>>>> this > >>>>>>> hybrid setup > >>>>>>> also I'd suggest to install directly from hbase/hadoop websites to > >>>>>>> eliminate additional variables > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas > >>>>>>> <c...@biotiquesystems.com>wrote: > >>>>>>> > >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The > >>>>>>>> datanodes/regionservers with the stuck processes all bare metal. > >>>>>>>> > >>>>>>>> -chris > >>>>>>>> > >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>>> "with 3 VMWare "master" nodes" > >>>>>>>>> > >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of > >>>>>>>>> using > >>>>>>>>> virtual machines > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas > >>>>>>>>> <c...@biotiquesystems.com > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if > >>>>>>>>>> anyone > >>>>>>>>>> had > >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but > >>>>>>>>>> right > >>>>>>>>>> after a select * on the exact same table will. > >>>>>>>>>> > >>>>>>>>>> If we export the table and then re-import it into a new table, > >>>>>>>>>> the > >>>>>>>>>> new > >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an > >>>>>>>> upsert..select > >>>>>>>>>> to do a copy. It seems something happens to the last region to > >>>>>>>>>> cause > >>>>>>>> this, > >>>>>>>>>> but it is not directly data dependent. Moving the region to > >>>>>>>>>> another > >>>>>>>>>> regionserver doesn't have any effect - just moves where the > >>>>>>>>>> problem > >>>>>>>>>> happens. Major compactions get hung up by the running threads as > >>>>>>>>>> they > >>>>>>>>>> probably have a lock. > >>>>>>>>>> > >>>>>>>>>> I've run the hfile tool on the final region and nothing seems > >>>>>>>>>> awry. > >>>>>>>>>> > >>>>>>>>>> Figuring this out will allow this project to continue, as of now > >>>>>>>>>> it > >>>>>>>>>> is > >>>>>>>>>> hung up on this issue. > >>>>>>>>>> > >>>>>>>>>> thank you, > >>>>>>>>>> -chris > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas > >>>>>>>>>> <c...@biotiquesystems.com> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came > >>>>>>>>>>> up > >>>>>>>>>>> with > >>>>>>>>>> where the RpcServer.handler appears to be stuck: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore > >>>>>>>>>> Fi > >>>>>>>>>> le > >>>>>>>>>> . > >>>>>>>>>> That is the last deepest method in all of the traces, either > >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces > >>>>>>>>>> for a > >>>>>>>>>> thread (which has been running for the last couple hours): > >>>>>>>>>>> > >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 > >>>>>>>>>>> tid=0x0000000000cdb800 > >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] > >>>>>>>>>>> java.lang.Thread.State: RUNNABLE > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex > >>>>>>>> t( > >>>>>>>> Fa > >>>>>>>> stDiffDeltaEncoder.java:540) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere > >>>>>>>> dE > >>>>>>>> nc > >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next > >>>>>>>> (H > >>>>>>>> Fi > >>>>>>>> leReaderV2.java:1063) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H > >>>>>>>> St > >>>>>>>> or > >>>>>>>> e.java:1776) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi > >>>>>>>> le > >>>>>>>> (H > >>>>>>>> Store.java:1722) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto > >>>>>>>> re > >>>>>>>> .j > >>>>>>>> ava:1655) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe > >>>>>>>> gi > >>>>>>>> on > >>>>>>>> .java:1826) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer > >>>>>>>> .j > >>>>>>>> av > >>>>>>>> a:2841) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic > >>>>>>>> e$ > >>>>>>>> 2. > >>>>>>>> callBlockingMethod(ClientProtos.java:28857) > >>>>>>>>>>> at > >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > >>>>>>>>>>> at > >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp > >>>>>>>> cS > >>>>>>>> ch > >>>>>>>> eduler.java:160) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS > >>>>>>>> ch > >>>>>>>> ed > >>>>>>>> uler.java:38) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu > >>>>>>>> le > >>>>>>>> r. > >>>>>>>> java:110) > >>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) > >>>>>>>>>>> > >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 > >>>>>>>>>>> tid=0x0000000000cdb800 > >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000] > >>>>>>>>>>> java.lang.Thread.State: RUNNABLE > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java: > >>>>>>>> 19 > >>>>>>>> 44 > >>>>>>>> ) > >>>>>>>>>>> at > >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro > >>>>>>>> ot > >>>>>>>> Bl > >>>>>>>> ockContainingKey(HFileBlockIndex.java:392) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo > >>>>>>>> ad > >>>>>>>> Da > >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se > >>>>>>>> ek > >>>>>>>> To > >>>>>>>> DataBlock(HFileBlockIndex.java:179) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see > >>>>>>>> kB > >>>>>>>> ef > >>>>>>>> ore(HFileReaderV2.java:548) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi > >>>>>>>> le > >>>>>>>> (H > >>>>>>>> Store.java:1712) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto > >>>>>>>> re > >>>>>>>> .j > >>>>>>>> ava:1655) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe > >>>>>>>> gi > >>>>>>>> on > >>>>>>>> .java:1826) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer > >>>>>>>> .j > >>>>>>>> av > >>>>>>>> a:2841) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic > >>>>>>>> e$ > >>>>>>>> 2. > >>>>>>>> callBlockingMethod(ClientProtos.java:28857) > >>>>>>>>>>> at > >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > >>>>>>>>>>> at > >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp > >>>>>>>> cS > >>>>>>>> ch > >>>>>>>> eduler.java:160) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS > >>>>>>>> ch > >>>>>>>> ed > >>>>>>>> uler.java:38) > >>>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu > >>>>>>>> le > >>>>>>>> r. > >>>>>>>> java:110) > >>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong > >>>>>>>>>>> <jzh...@hortonworks.com> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Hey Chris, > >>>>>>>>>>>> > >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows > >>>>>>>>>>>> in > >>>>>>>>>>>> one > >>>>>>>>>>>> table, run the following query from sqlline.py and everything > >>>>>>>>>>>> seems > >>>>>>>> fine > >>>>>>>>>>>> without seeing CPU running hot. > >>>>>>>>>>>> > >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) > >>>>>>>>>>>> from > >>>>>>>>>>>> PERFORMANCE_50000; > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> | COUNT(1) | > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> | 50000 | > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> 1 row selected (0.166 seconds) > >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) > >>>>>>>>>>>> from > >>>>>>>>>>>> PERFORMANCE_50000; > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> | COUNT(1) | > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> | 50000 | > >>>>>>>>>>>> +------------+ > >>>>>>>>>>>> 1 row selected (0.167 seconds) > >>>>>>>>>>>> > >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU > >>>>>>>>>>>> goes? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it: > >>>>>>>>>>>>> > >>>>>>>>>>>>> http://pastebin.com/w6mtabag > >>>>>>>>>>>>> > >>>>>>>>>>>>> thanks again, > >>>>>>>>>>>>> -chris > >>>>>>>>>>>>> > >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Chris, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you > >>>>>>>>>>>>>> pastebin it > >>>>>>>>>>>>>> some > >>>>>>>>>>>>>> place? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Nick > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas > >>>>>>>>>>>>>> <c...@biotiquesystems.com>wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hello, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to > >>>>>>>>>>>>>>> have > >>>>>>>> been > >>>>>>>>>>>>>>> set > >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have > >>>>>>>>>>>>>>> narrowed > >>>>>>>>>>>>>>> down > >>>>>>>>>> what > >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that > >>>>>>>>>>>>>>> selects > >>>>>>>> one > >>>>>>>>>>>>>>> row > >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and > >>>>>>>>>>>>>>> the > >>>>>>>> number > >>>>>>>>>>>>>>> of RPC > >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select > >>>>>>>> count(*) > >>>>>>>>>>>>>>> from > >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on > >>>>>>>>>>>>>>> non-primary > >>>>>>>>>>>>>>> key > >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler > >>>>>>>>>>>>>>> threads > >>>>>>>>>>>>>>> increases > >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by > >>>>>>>>>>>>>>> ~105%. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the > >>>>>>>>>>>>>>> RpcServer.handler > >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher > >>>>>>>>>>>>>>> usage. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10 > >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other > >>>>>>>>>>>>>>> monitored > >>>>>>>>>>>>>>> tasks > >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a > >>>>>>>>>>>>>>> Used > >>>>>>>> heap > >>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> 445.2M. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging > >>>>>>>>>>>>>>> turned on > >>>>>>>>>>>>>>> right > >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this > >>>>>>>>>>>>>>> statement > >>>>>>>>>> actually > >>>>>>>>>>>>>>> used up the last available handler). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> thanks, > >>>>>>>>>>>>>>> -chris > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong > >>>>>>>>>>>>>>> <jzh...@hortonworks.com > >>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc > >>>>>>>>>>>>>>>> handler > >>>>>>>> number > >>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can > >>>>>>>>>>>>>>>> search > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ > >>>>>>>>>> er > >>>>>>>>>> .j > >>>>>>>>>> av > >>>>>>>>>>>>>>> a:2 > >>>>>>>>>>>>>>>> 841). > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding > >>>>>>>>>>>>>>>> some > >>>>>>>> log > >>>>>>>>>>>>>>>> info > >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the > >>>>>>>>>>>>>>>> number > >>>>>>>>>>>>>>>> of > >>>>>>>> rpc > >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current > >>>>>>>>>>>>>>>> issue. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" > >>>>>>>>>>>>>>>> <c...@biotiquesystems.com> > >>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU > >>>>>>>>>>>>>>>>> usage > >>>>>>>>>>>>>>>>> spiking > >>>>>>>>>>>>>>> up > >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node > >>>>>>>>>>>>>>>>> cluster > >>>>>>>> (with > >>>>>>>>>> 3 > >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a > >>>>>>>>>>>>>>>>> plain > >>>>>>>>>>>>>>>>> Ambari > >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM > >>>>>>>>>>>>>>>>> release > >>>>>>>>>>>>>>>>> and > >>>>>>>>>> hand > >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was > >>>>>>>>>>>>>>>>> going > >>>>>>>>>> well > >>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables. > >>>>>>>>>>>>>>>>> What > >>>>>>>>>>>>>>>>> happened > >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers > >>>>>>>>>>>>>>>>> became > >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU > >>>>>>>>>>>>>>>>> (12 > >>>>>>>> core > >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs > >>>>>>>>>>>>>>>>> (initially > >>>>>>>> we > >>>>>>>>>>>>>>>>> saw > >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was > >>>>>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>>> all > >>>>>>>> of > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). > >>>>>>>>>>>>>>>>> We > >>>>>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>>> able > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> cluster > >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up > >>>>>>>>>>>>>>>>> again. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information > >>>>>>>>>>>>>>>>> and > >>>>>>>> started > >>>>>>>>>>>>>>>>> from > >>>>>>>>>>>>>>>>> scratch, but that has not helped. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've > >>>>>>>>>>>>>>>>> attached a > >>>>>>>> jstack > >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone > >>>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>> shed > >>>>>>>>>>>>>>>>> some > >>>>>>>>>>>>>>>>> light. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> thanks, > >>>>>>>>>>>>>>>>> -chris > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE > >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the > >>>>>>>>>>>>>>>> individual > >>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>> entity > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> which it is addressed and may contain information that is > >>>>>>>>>>>>>>>> confidential, > >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. > >>>>>>>>>>>>>>>> If > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> reader > >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are > >>>>>>>>>>>>>>>> hereby > >>>>>>>>>> notified > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution, > >>>>>>>>>>>>>>>> disclosure > >>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If > >>>>>>>>>>>>>>>> you > >>>>>>>> have > >>>>>>>>>>>>>>>> received this communication in error, please contact the > >>>>>>>>>>>>>>>> sender > >>>>>>>>>>>>>>> immediately > >>>>>>>>>>>>>>>> and delete it from your system. Thank You. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> CONFIDENTIALITY NOTICE > >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual > >>>>>>>>>>>> or > >>>>>>>>>> entity to > >>>>>>>>>>>> which it is addressed and may contain information that is > >>>>>>>> confidential, > >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If > >>>>>>>>>>>> the > >>>>>>>>>> reader > >>>>>>>>>>>> of this message is not the intended recipient, you are hereby > >>>>>>>>>>>> notified > >>>>>>>>>> that > >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure > >>>>>>>>>>>> or > >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you > >>>>>>>>>>>> have > >>>>>>>>>>>> received this communication in error, please contact the sender > >>>>>>>>>> immediately > >>>>>>>>>>>> and delete it from your system. Thank You. > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> CONFIDENTIALITY NOTICE > >>>>> NOTICE: This message is intended for the use of the individual or > >>>>> entity to > >>>>> which it is addressed and may contain information that is > >>>>> confidential, > >>>>> privileged and exempt from disclosure under applicable law. If the > >>>>> reader > >>>>> of this message is not the intended recipient, you are hereby notified > >>>>> that > >>>>> any printing, copying, dissemination, distribution, disclosure or > >>>>> forwarding of this communication is strictly prohibited. If you have > >>>>> received this communication in error, please contact the sender > >>>>> immediately > >>>>> and delete it from your system. Thank You. > >>>> > >>> > >>> > >>> > >>> -- > >>> CONFIDENTIALITY NOTICE > >>> NOTICE: This message is intended for the use of the individual or > >>> entity to > >>> which it is addressed and may contain information that is confidential, > >>> privileged and exempt from disclosure under applicable law. If the > >>> reader > >>> of this message is not the intended recipient, you are hereby notified > >>> that > >>> any printing, copying, dissemination, distribution, disclosure or > >>> forwarding of this communication is strictly prohibited. If you have > >>> received this communication in error, please contact the sender > >>> immediately > >>> and delete it from your system. Thank You. > >> > > > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender immediately > > and delete it from your system. Thank You. > > > > > -- > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader of > this message is not the intended recipient, you are hereby notified that any > printing, copying, dissemination, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You.