I will see how possible a new, all bare metal cluster is, but we have 5 other similarly configured clusters with no troubles. We only have an issue when Phoenix does a full scan of some tables that a handler gets stuck in org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile after it has returned all of its data.
thank you, -chris On May 20, 2014, at 12:34 PM, alex kamil <alex.ka...@gmail.com> wrote: > i'm just guessing but may be region servers are stuck because they can't > communicate with zookeeper or hmaster, or may be datanodes can't talk to > namenode because of some vmware networking quirk and some internal thread > is hanging, who knows, > it would take 30 min to setup up bare metal cluster vs days debugging this > hybrid setup > also I'd suggest to install directly from hbase/hadoop websites to > eliminate additional variables > > > > > On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas <c...@biotiquesystems.com>wrote: > >> Only the master nodes (HMaster, ZK and NN) are VMs. The >> datanodes/regionservers with the stuck processes all bare metal. >> >> -chris >> >> On May 20, 2014, at 11:52 AM, alex kamil <alex.ka...@gmail.com> wrote: >> >>>> "with 3 VMWare "master" nodes" >>> >>> can you try running hbase/phoenix on physical nodes instead of using >>> virtual machines >>> >>> >>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com >>> wrote: >>> >>>> Sorry to follow up on my own message, but I was wondering if anyone had >>>> any ideas? Normal non-phoenix scans don't cause this symptom, but right >>>> after a select * on the exact same table will. >>>> >>>> If we export the table and then re-import it into a new table, the new >>>> table doesn't exhibit these symptoms, same as if we use an >> upsert..select >>>> to do a copy. It seems something happens to the last region to cause >> this, >>>> but it is not directly data dependent. Moving the region to another >>>> regionserver doesn't have any effect - just moves where the problem >>>> happens. Major compactions get hung up by the running threads as they >>>> probably have a lock. >>>> >>>> I've run the hfile tool on the final region and nothing seems awry. >>>> >>>> Figuring this out will allow this project to continue, as of now it is >>>> hung up on this issue. >>>> >>>> thank you, >>>> -chris >>>> >>>> >>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> >> wrote: >>>> >>>>> I did some "poor mans" profiling with multiple jstack and came up with >>>> where the RpcServer.handler appears to be stuck: >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. >>>> That is the last deepest method in all of the traces, either >>>> HStore.java:1712 or HStore.java:1722. Here are two example traces for a >>>> thread (which has been running for the last couple hours): >>>>> >>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 >>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>> java.lang.Thread.State: RUNNABLE >>>>> at >>>> >> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540) >>>>> at >>>> >> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261) >>>>> at >>>> >> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) >>>>> at >>>> >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) >>>>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> >>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 >>>> nid=0x727b runnable [0x00007f4b49e9e000] >>>>> java.lang.Thread.State: RUNNABLE >>>>> at >>>> >> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944) >>>>> at >> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) >>>>> at >>>> >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392) >>>>> at >>>> >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209) >>>>> at >>>> >> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179) >>>>> at >>>> >> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) >>>>> at >>>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) >>>>> at >>>> >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) >>>>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) >>>>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >>>>> at >>>> >> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> >>>>> >>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com> >>>> wrote: >>>>> >>>>>> >>>>>> Hey Chris, >>>>>> >>>>>> I used performance.py tool which created a table with 50K rows in one >>>>>> table, run the following query from sqlline.py and everything seems >> fine >>>>>> without seeing CPU running hot. >>>>>> >>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>>>> PERFORMANCE_50000; >>>>>> +------------+ >>>>>> | COUNT(1) | >>>>>> +------------+ >>>>>> | 50000 | >>>>>> +------------+ >>>>>> 1 row selected (0.166 seconds) >>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from >>>>>> PERFORMANCE_50000; >>>>>> +------------+ >>>>>> | COUNT(1) | >>>>>> +------------+ >>>>>> | 50000 | >>>>>> +------------+ >>>>>> 1 row selected (0.167 seconds) >>>>>> >>>>>> Is there anyway could you run profiler to see where the CPU goes? >>>>>> >>>>>> >>>>>> >>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: >>>>>> >>>>>>> Ahh, yes. Here is a pastebin for it: >>>>>>> >>>>>>> http://pastebin.com/w6mtabag >>>>>>> >>>>>>> thanks again, >>>>>>> -chris >>>>>>> >>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> >> wrote: >>>>>>> >>>>>>>> Hi Chris, >>>>>>>> >>>>>>>> Attachments are filtered out by the mail server. Can you pastebin it >>>>>>>> some >>>>>>>> place? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Nick >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas >>>>>>>> <c...@biotiquesystems.com>wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have >> been >>>>>>>>> set >>>>>>>>> to 60 by Ambari during install process). Now we have narrowed down >>>> what >>>>>>>>> causes the CPU to increase and have some detailed logs: >>>>>>>>> >>>>>>>>> If we connect using sqlline.py and execute a select that selects >> one >>>>>>>>> row >>>>>>>>> using the primary_key, no increate in CPU is observed and the >> number >>>>>>>>> of RPC >>>>>>>>> threads in a RUNNABLE state remains the same. >>>>>>>>> >>>>>>>>> If we execute a select that scans the table such as "select >> count(*) >>>>>>>>> from >>>>>>>>> TABLE" or where the "where" clause only limits on non-primary key >>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads >>>>>>>>> increases >>>>>>>>> and the CPU utilization of the regionserver increases by ~105%. >>>>>>>>> >>>>>>>>> Disconnecting the client does not have an effect and the >>>>>>>>> RpcServer.handler >>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage. >>>>>>>>> >>>>>>>>> Checking the Web Console for the Regionserver just shows 10 >>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored >>>>>>>>> tasks >>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used >> heap >>>>>>>>> of >>>>>>>>> 445.2M. >>>>>>>>> >>>>>>>>> I've attached the regionserver log with IPC debug logging turned on >>>>>>>>> right >>>>>>>>> when one of the Phoenix statements is executed (this statement >>>> actually >>>>>>>>> used up the last available handler). >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> -chris >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com >>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> From the stack, it seems you increase the default rpc handler >> number >>>>>>>>>> to >>>>>>>>>> about 60. All handlers are serving Get request(You can search >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav >>>>>>>>> a:2 >>>>>>>>>> 841). >>>>>>>>>> >>>>>>>>>> You can check why there are so many get requests by adding some >> log >>>>>>>>>> info >>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of >> rpc >>>>>>>>>> handlers per region server, it will mitigate your current issue. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> >> wrote: >>>>>>>>>> >>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage >>>>>>>>>>> spiking >>>>>>>>> up >>>>>>>>>>> to use all available CPU and becoming unresponsive. >>>>>>>>>>> >>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster >> (with >>>> 3 >>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain >>>>>>>>>>> Ambari >>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and >>>> hand >>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was going >>>> well >>>>>>>>>>> and >>>>>>>>>>> we were able to load in ~30k records into several tables. What >>>>>>>>>>> happened >>>>>>>>>>> was after about 3-4 days of being up the regionservers became >>>>>>>>>>> unresponsive and started to use most of the available CPU (12 >> core >>>>>>>>>>> boxes). Nothing terribly informative was in the logs (initially >> we >>>>>>>>>>> saw >>>>>>>>>>> some flush messages that seemed excessive, but that was not all >> of >>>>>>>>>>> the >>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We are >>>>>>>>>>> able >>>>>>>>> to >>>>>>>>>>> kill the unresponsive regionservers and then restart them, the >>>>>>>>>>> cluster >>>>>>>>>>> will be fine for a day or so but will start to lock up again. >>>>>>>>>>> >>>>>>>>>>> We've dropped the entire HBase and zookeper information and >> started >>>>>>>>>>> from >>>>>>>>>>> scratch, but that has not helped. >>>>>>>>>>> >>>>>>>>>>> James Taylor suggested I send this off here. I've attached a >> jstack >>>>>>>>>>> report of a locked up regionserver in hopes that someone can shed >>>>>>>>>>> some >>>>>>>>>>> light. >>>>>>>>>>> >>>>>>>>>>> thanks, >>>>>>>>>>> -chris >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>>>> entity >>>>>>>>> to >>>>>>>>>> which it is addressed and may contain information that is >>>>>>>>>> confidential, >>>>>>>>>> privileged and exempt from disclosure under applicable law. If the >>>>>>>>>> reader >>>>>>>>>> of this message is not the intended recipient, you are hereby >>>> notified >>>>>>>>> that >>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or >>>>>>>>>> forwarding of this communication is strictly prohibited. If you >> have >>>>>>>>>> received this communication in error, please contact the sender >>>>>>>>> immediately >>>>>>>>>> and delete it from your system. Thank You. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> CONFIDENTIALITY NOTICE >>>>>> NOTICE: This message is intended for the use of the individual or >>>> entity to >>>>>> which it is addressed and may contain information that is >> confidential, >>>>>> privileged and exempt from disclosure under applicable law. If the >>>> reader >>>>>> of this message is not the intended recipient, you are hereby notified >>>> that >>>>>> any printing, copying, dissemination, distribution, disclosure or >>>>>> forwarding of this communication is strictly prohibited. If you have >>>>>> received this communication in error, please contact the sender >>>> immediately >>>>>> and delete it from your system. Thank You. >>>>> >>>> >>>> >> >>