Only the master nodes (HMaster, ZK and NN) are VMs. The datanodes/regionservers with the stuck processes all bare metal.
-chris On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote: >> "with 3 VMWare "master" nodes" > > can you try running hbase/phoenix on physical nodes instead of using > virtual machines > > > On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com>wrote: > >> Sorry to follow up on my own message, but I was wondering if anyone had >> any ideas? Normal non-phoenix scans don't cause this symptom, but right >> after a select * on the exact same table will. >> >> If we export the table and then re-import it into a new table, the new >> table doesn't exhibit these symptoms, same as if we use an upsert..select >> to do a copy. It seems something happens to the last region to cause this, >> but it is not directly data dependent. Moving the region to another >> regionserver doesn't have any effect - just moves where the problem >> happens. Major compactions get hung up by the running threads as they >> probably have a lock. >> >> I've run the hfile tool on the final region and nothing seems awry. >> >> Figuring this out will allow this project to continue, as of now it is >> hung up on this issue. >> >> thank you, >> -chris >> >> >> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> wrote: >> >>> I did some "poor mans" profiling with multiple jstack and came up with >> where the RpcServer.handler appears to be stuck: >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. >> That is the last deepest method in all of the traces, either >> HStore.java:1712 or HStore.java:1722. Here are two example traces for a >> thread (which has been running for the last couple hours): >>> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 >> nid=0x727b runnable [0x00007f4b49e9e000] >>> java.lang.Thread.State: RUNNABLE >>> at >> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540) >>> at >> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261) >>> at >> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063) >>> at >> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776) >>> at >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722) >>> at >> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) >>> at >> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) >>> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) >>> at >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) >>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >>> at java.lang.Thread.run(Thread.java:744) >>> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 >> nid=0x727b runnable [0x00007f4b49e9e000] >>> java.lang.Thread.State: RUNNABLE >>> at >> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944) >>> at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) >>> at >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392) >>> at >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209) >>> at >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179) >>> at >> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548) >>> at >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712) >>> at >> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) >>> at >> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) >>> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) >>> at >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) >>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >>> at >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >>> at java.lang.Thread.run(Thread.java:744) >>> >>> >>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com> >> wrote: >>> >>>> >>>> Hey Chris, >>>> >>>> I used performance.py tool which created a table with 50K rows in one >>>> table, run the following query from sqlline.py and everything seems fine >>>> without seeing CPU running hot. >>>> >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>> PERFORMANCE_50000; >>>> +------------+ >>>> | COUNT(1) | >>>> +------------+ >>>> | 50000 | >>>> +------------+ >>>> 1 row selected (0.166 seconds) >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>> PERFORMANCE_50000; >>>> +------------+ >>>> | COUNT(1) | >>>> +------------+ >>>> | 50000 | >>>> +------------+ >>>> 1 row selected (0.167 seconds) >>>> >>>> Is there anyway could you run profiler to see where the CPU goes? >>>> >>>> >>>> >>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>>> >>>>> Ahh, yes. Here is a pastebin for it: >>>>> >>>>> http://pastebin.com/w6mtabag >>>>> >>>>> thanks again, >>>>> -chris >>>>> >>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: >>>>> >>>>>> Hi Chris, >>>>>> >>>>>> Attachments are filtered out by the mail server. Can you pastebin it >>>>>> some >>>>>> place? >>>>>> >>>>>> Thanks, >>>>>> Nick >>>>>> >>>>>> >>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas >>>>>> <c...@biotiquesystems.com>wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have been >>>>>>> set >>>>>>> to 60 by Ambari during install process). Now we have narrowed down >> what >>>>>>> causes the CPU to increase and have some detailed logs: >>>>>>> >>>>>>> If we connect using sqlline.py and execute a select that selects one >>>>>>> row >>>>>>> using the primary_key, no increate in CPU is observed and the number >>>>>>> of RPC >>>>>>> threads in a RUNNABLE state remains the same. >>>>>>> >>>>>>> If we execute a select that scans the table such as "select count(*) >>>>>>> from >>>>>>> TABLE" or where the "where" clause only limits on non-primary key >>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads >>>>>>> increases >>>>>>> and the CPU utilization of the regionserver increases by ~105%. >>>>>>> >>>>>>> Disconnecting the client does not have an effect and the >>>>>>> RpcServer.handler >>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage. >>>>>>> >>>>>>> Checking the Web Console for the Regionserver just shows 10 >>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored >>>>>>> tasks >>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap >>>>>>> of >>>>>>> 445.2M. >>>>>>> >>>>>>> I've attached the regionserver log with IPC debug logging turned on >>>>>>> right >>>>>>> when one of the Phoenix statements is executed (this statement >> actually >>>>>>> used up the last available handler). >>>>>>> >>>>>>> thanks, >>>>>>> -chris >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> From the stack, it seems you increase the default rpc handler number >>>>>>>> to >>>>>>>> about 60. All handlers are serving Get request(You can search >>>>>>>> >>>>>>> >>>>>>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav >>>>>>> a:2 >>>>>>>> 841). >>>>>>>> >>>>>>>> You can check why there are so many get requests by adding some log >>>>>>>> info >>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc >>>>>>>> handlers per region server, it will mitigate your current issue. >>>>>>>> >>>>>>>> >>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>>>>>>> >>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage >>>>>>>>> spiking >>>>>>> up >>>>>>>>> to use all available CPU and becoming unresponsive. >>>>>>>>> >>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with >> 3 >>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain >>>>>>>>> Ambari >>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and >> hand >>>>>>>>> linked in the jar files to the hadoop lib. Everything was going >> well >>>>>>>>> and >>>>>>>>> we were able to load in ~30k records into several tables. What >>>>>>>>> happened >>>>>>>>> was after about 3-4 days of being up the regionservers became >>>>>>>>> unresponsive and started to use most of the available CPU (12 core >>>>>>>>> boxes). Nothing terribly informative was in the logs (initially we >>>>>>>>> saw >>>>>>>>> some flush messages that seemed excessive, but that was not all of >>>>>>>>> the >>>>>>>>> time and we changed back to the standard HBase WAL codec). We are >>>>>>>>> able >>>>>>> to >>>>>>>>> kill the unresponsive regionservers and then restart them, the >>>>>>>>> cluster >>>>>>>>> will be fine for a day or so but will start to lock up again. >>>>>>>>> >>>>>>>>> We've dropped the entire HBase and zookeper information and started >>>>>>>>> from >>>>>>>>> scratch, but that has not helped. >>>>>>>>> >>>>>>>>> James Taylor suggested I send this off here. I've attached a jstack >>>>>>>>> report of a locked up regionserver in hopes that someone can shed >>>>>>>>> some >>>>>>>>> light. >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> -chris >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>> entity >>>>>>> to >>>>>>>> which it is addressed and may contain information that is >>>>>>>> confidential, >>>>>>>> privileged and exempt from disclosure under applicable law. If the >>>>>>>> reader >>>>>>>> of this message is not the intended recipient, you are hereby >> notified >>>>>>> that >>>>>>>> any printing, copying, dissemination, distribution, disclosure or >>>>>>>> forwarding of this communication is strictly prohibited. If you have >>>>>>>> received this communication in error, please contact the sender >>>>>>> immediately >>>>>>>> and delete it from your system. Thank You. >>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>>> >>>> -- >>>> CONFIDENTIALITY NOTICE >>>> NOTICE: This message is intended for the use of the individual or >> entity to >>>> which it is addressed and may contain information that is confidential, >>>> privileged and exempt from disclosure under applicable law. If the >> reader >>>> of this message is not the intended recipient, you are hereby notified >> that >>>> any printing, copying, dissemination, distribution, disclosure or >>>> forwarding of this communication is strictly prohibited. If you have >>>> received this communication in error, please contact the sender >> immediately >>>> and delete it from your system. Thank You. >>> >> >>