>"with 3 VMWare "master" nodes" can you try running hbase/phoenix on physical nodes instead of using virtual machines
On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <c...@biotiquesystems.com>wrote: > Sorry to follow up on my own message, but I was wondering if anyone had > any ideas? Normal non-phoenix scans don't cause this symptom, but right > after a select * on the exact same table will. > > If we export the table and then re-import it into a new table, the new > table doesn't exhibit these symptoms, same as if we use an upsert..select > to do a copy. It seems something happens to the last region to cause this, > but it is not directly data dependent. Moving the region to another > regionserver doesn't have any effect - just moves where the problem > happens. Major compactions get hung up by the running threads as they > probably have a lock. > > I've run the hfile tool on the final region and nothing seems awry. > > Figuring this out will allow this project to continue, as of now it is > hung up on this issue. > > thank you, > -chris > > > On May 15, 2014, at 8:47 PM, Chris Tarnas <c...@biotiquesystems.com> wrote: > > > I did some "poor mans" profiling with multiple jstack and came up with > where the RpcServer.handler appears to be stuck: > org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. > That is the last deepest method in all of the traces, either > HStore.java:1712 or HStore.java:1722. Here are two example traces for a > thread (which has been running for the last couple hours): > > > > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 > nid=0x727b runnable [0x00007f4b49e9e000] > > java.lang.Thread.State: RUNNABLE > > at > org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540) > > at > org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261) > > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063) > > at > org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776) > > at > org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722) > > at > org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) > > at > org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) > > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > > at java.lang.Thread.run(Thread.java:744) > > > > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 > nid=0x727b runnable [0x00007f4b49e9e000] > > java.lang.Thread.State: RUNNABLE > > at > org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944) > > at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622) > > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392) > > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209) > > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179) > > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548) > > at > org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712) > > at > org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655) > > at > org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841) > > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857) > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > > at java.lang.Thread.run(Thread.java:744) > > > > > > On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzh...@hortonworks.com> > wrote: > > > >> > >> Hey Chris, > >> > >> I used performance.py tool which created a table with 50K rows in one > >> table, run the following query from sqlline.py and everything seems fine > >> without seeing CPU running hot. > >> > >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from > >> PERFORMANCE_50000; > >> +------------+ > >> | COUNT(1) | > >> +------------+ > >> | 50000 | > >> +------------+ > >> 1 row selected (0.166 seconds) > >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from > >> PERFORMANCE_50000; > >> +------------+ > >> | COUNT(1) | > >> +------------+ > >> | 50000 | > >> +------------+ > >> 1 row selected (0.167 seconds) > >> > >> Is there anyway could you run profiler to see where the CPU goes? > >> > >> > >> > >> On 5/13/14 6:40 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >> > >>> Ahh, yes. Here is a pastebin for it: > >>> > >>> http://pastebin.com/w6mtabag > >>> > >>> thanks again, > >>> -chris > >>> > >>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > >>> > >>>> Hi Chris, > >>>> > >>>> Attachments are filtered out by the mail server. Can you pastebin it > >>>> some > >>>> place? > >>>> > >>>> Thanks, > >>>> Nick > >>>> > >>>> > >>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas > >>>> <c...@biotiquesystems.com>wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> We set the HBase RegionServer Handler to 10 (it appears to have been > >>>>> set > >>>>> to 60 by Ambari during install process). Now we have narrowed down > what > >>>>> causes the CPU to increase and have some detailed logs: > >>>>> > >>>>> If we connect using sqlline.py and execute a select that selects one > >>>>> row > >>>>> using the primary_key, no increate in CPU is observed and the number > >>>>> of RPC > >>>>> threads in a RUNNABLE state remains the same. > >>>>> > >>>>> If we execute a select that scans the table such as "select count(*) > >>>>> from > >>>>> TABLE" or where the "where" clause only limits on non-primary key > >>>>> attributes, then the number of RUNNABLE RpcServer.handler threads > >>>>> increases > >>>>> and the CPU utilization of the regionserver increases by ~105%. > >>>>> > >>>>> Disconnecting the client does not have an effect and the > >>>>> RpcServer.handler > >>>>> thread is left RUNNABLE and the CPU stays at the higher usage. > >>>>> > >>>>> Checking the Web Console for the Regionserver just shows 10 > >>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored > >>>>> tasks > >>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap > >>>>> of > >>>>> 445.2M. > >>>>> > >>>>> I've attached the regionserver log with IPC debug logging turned on > >>>>> right > >>>>> when one of the Phoenix statements is executed (this statement > actually > >>>>> used up the last available handler). > >>>>> > >>>>> thanks, > >>>>> -chris > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzh...@hortonworks.com> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>> From the stack, it seems you increase the default rpc handler number > >>>>>> to > >>>>>> about 60. All handlers are serving Get request(You can search > >>>>>> > >>>>> > >>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav > >>>>> a:2 > >>>>>> 841). > >>>>>> > >>>>>> You can check why there are so many get requests by adding some log > >>>>>> info > >>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc > >>>>>> handlers per region server, it will mitigate your current issue. > >>>>>> > >>>>>> > >>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <c...@biotiquesystems.com> wrote: > >>>>>> > >>>>>>> We have hit a problem with Phoenix and regionservers CPU usage > >>>>>>> spiking > >>>>> up > >>>>>>> to use all available CPU and becoming unresponsive. > >>>>>>> > >>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with > 3 > >>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain > >>>>>>> Ambari > >>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and > hand > >>>>>>> linked in the jar files to the hadoop lib. Everything was going > well > >>>>>>> and > >>>>>>> we were able to load in ~30k records into several tables. What > >>>>>>> happened > >>>>>>> was after about 3-4 days of being up the regionservers became > >>>>>>> unresponsive and started to use most of the available CPU (12 core > >>>>>>> boxes). Nothing terribly informative was in the logs (initially we > >>>>>>> saw > >>>>>>> some flush messages that seemed excessive, but that was not all of > >>>>>>> the > >>>>>>> time and we changed back to the standard HBase WAL codec). We are > >>>>>>> able > >>>>> to > >>>>>>> kill the unresponsive regionservers and then restart them, the > >>>>>>> cluster > >>>>>>> will be fine for a day or so but will start to lock up again. > >>>>>>> > >>>>>>> We've dropped the entire HBase and zookeper information and started > >>>>>>> from > >>>>>>> scratch, but that has not helped. > >>>>>>> > >>>>>>> James Taylor suggested I send this off here. I've attached a jstack > >>>>>>> report of a locked up regionserver in hopes that someone can shed > >>>>>>> some > >>>>>>> light. > >>>>>>> > >>>>>>> thanks, > >>>>>>> -chris > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> CONFIDENTIALITY NOTICE > >>>>>> NOTICE: This message is intended for the use of the individual or > >>>>>> entity > >>>>> to > >>>>>> which it is addressed and may contain information that is > >>>>>> confidential, > >>>>>> privileged and exempt from disclosure under applicable law. If the > >>>>>> reader > >>>>>> of this message is not the intended recipient, you are hereby > notified > >>>>> that > >>>>>> any printing, copying, dissemination, distribution, disclosure or > >>>>>> forwarding of this communication is strictly prohibited. If you have > >>>>>> received this communication in error, please contact the sender > >>>>> immediately > >>>>>> and delete it from your system. Thank You. > >>>>> > >>>>> > >>>>> > >> > >> > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or > entity to > >> which it is addressed and may contain information that is confidential, > >> privileged and exempt from disclosure under applicable law. If the > reader > >> of this message is not the intended recipient, you are hereby notified > that > >> any printing, copying, dissemination, distribution, disclosure or > >> forwarding of this communication is strictly prohibited. If you have > >> received this communication in error, please contact the sender > immediately > >> and delete it from your system. Thank You. > > > >