Thanks. I do not know if it was a burst. When I get a chance I'll try to reproduce this again with a profiler attached to the process.
> On Nov 5, 2015, at 11:25 PM, Matteo Bertozzi <[email protected]> wrote: > > do you know if that is just a burst on startup while the SSH is running, or > it is going on forever? > in theory it should be just on startup when the SSH is running, that is by > implementation and I'm working on completing the procedure scheduler to > remove that spin on startup. > > if it is going on forever that is a bug, that i'm still unable to > reproduce. > > >> On Thu, Nov 5, 2015 at 10:45 PM, Andrew Purtell <[email protected]> wrote: >> >> It was definitely something to do with SSH. I was able to reproduce, then >> ran through this again but manually cleared away all files in WALs/* after >> shutdown, and then the profile was what I'd expect. >> >> >> On Thu, Nov 5, 2015 at 8:41 PM, Andrew Purtell <[email protected]> >> wrote: >> >>> >>> >>> On Thu, Nov 5, 2015 at 12:18 PM, Matteo Bertozzi < >> [email protected]> >>> wrote: >>> >>>> what is the shutdown and restart? a clean stop? kill? >>> >>> The shutdown of a ^C of the minicluster as launched out of the dev tree. >>> The termination looked normal to be but I did see the SSH handler kick in >>> upon restart. >>> >>> is the second YCSB job run on the cloned table? >>> >>> Yes >>> >>> >>> >>>> >>>> the only master operation with procedures that I see here are >>>> - create meta & namespace table >>>> - create the table for ycsb >>>> - disable the table for snapshot >>>> - maybe enable for the table after restart? >>>> >>>> so, 5 procedure running let say 10 step each (exaggerating) you'll have >> 50 >>>> calls to poll() >>>> then we have the SSH procedure, how many server does the cluster have? >>> >>> The minicluster in master embeds a zk quorum peer, the master, and two >>> regionservers. >>> >>> >>> >>>> maybe we keep spinning on the server actions after restart... >>> That was a thought I had as well, spinning on poll of the master >>> procedure queue after restart. >>> >>> >>> >>>> >>>> >>>> Matteo >>>> >>>> >>>> On Thu, Nov 5, 2015 at 11:47 AM, Andrew Purtell < >> [email protected] >>>> wrote: >>>> >>>>> Here's what I did: >>>>> >>>>> 0. Launch minicluster out of dev tree with ./bin/hbase master start >>>>> 1. Load 10M rows using YCSB >>>>> 2. Snapshot >>>>> 3. Shutdown and restart >>>>> 4. Clone a test table from the snapshot >>>>> 5. Run YCSB workload A for 2M operations >>>>> >>>>> The profile was captured at step 5. >>>>> >>>>> HTH >>>>> >>>>> >>>>>> On Nov 5, 2015, at 10:40 AM, Matteo Bertozzi < >> [email protected] >>>>> >>>>> wrote: >>>>>> >>>>>> what operation do you have going on the HMaster? >>>>>> I did a simple test where there are no operation going on and poll >>>>> doesn't >>>>>> get out of the waitCond.await() >>>>>> so unless you have operations on the master you'll never call >>>>> queue.poll() >>>>>> >>>>>> or maybe, there's a bug and the previous ops where hanging and >>>> trigger a >>>>>> loop. >>>>>> what did you executed on the master before? >>>>>> do you have a script or something that I can use to reproduce this? >>>>>> >>>>>> Matteo >>>>>> >>>>>> >>>>>> On Thu, Nov 5, 2015 at 10:25 AM, Matteo Bertozzi < >>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> let me take a look, but branch-1 and master should be the same on >> the >>>>> proc >>>>>>> stuff >>>>>>> >>>>>>> Matteo >>>>>>> >>>>>>> >>>>>>>> On Thu, Nov 5, 2015 at 10:23 AM, Stack <[email protected]> wrote: >>>>>>>> >>>>>>>> You are looking at master process only? >>>>>>>> >>>>>>>> I've been looking at branch-1 only of late. >>>>>>>> >>>>>>>> St.Ack >>>>>>>> >>>>>>>> On Thu, Nov 5, 2015 at 12:06 AM, Andrew Purtell < >>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I just profiled trunk running YCSB workload A. I was looking for >>>>>>>> something >>>>>>>>> else but found completely unexpected results. >>>>>>>>> >>>>>>>>> I see 60% of all allocation pressure in TLABs coming from >>>>>>>>> ProcedureExecutors, allocating 16 GB of >>>>>>>>> java.util.AbstractMap$SimpleImmutableEntry in this code: >>>>>>>>> >>>>>>>>> Stack Trace TLABs Total TLAB Size(bytes) Pressure(%) >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap$Node.createSnapshot() >>>>> 11,603 >>>>>>>>> 18,402,380,120 100 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.firstEntry() 11,599 >>>>>>>>> 18,401,678,776 99.996 >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >>>>>>>>> 11,599 18,401,678,776 99.996 >>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>>>>>>> 11,599 18,401,678,776 99.996 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>>>>> 11,599 >>>>>>>>> 18,401,678,776 99.996 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >>>> 11,599 >>>>>>>>> 18,401,678,776 99.996 >>>>>>>>> >>>>>>>>> >>>>>>>>> The three ProcedureExecutor threads also consume about 60% of all >>>> CPU >>>>>>>> time, >>>>>>>>> each about 20%. Here are the top hot code arcs from one of them: >>>>>>>>> >>>>>>>>> Stack Trace Sample Count Percentage(%) >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>> 1,521 >>>>>>>>> 39.191 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>>>>>> 1,521 39.191 >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >>>>>>>> 1,521 >>>>>>>>> 39.191 >>>>>>>>> java.lang.String$CaseInsensitiveComparator.compare(String, >> String) >>>> 614 >>>>>>>>> 15.821 >>>>>>>>> java.lang.String$CaseInsensitiveComparator.compare(Object, >>>> Object) >>>>>>>> 614 >>>>>>>>> 15.821 >>>>>>>>> java.lang.String.compareToIgnoreCase(String) 614 15.821 >>>>>>>>> org.apache.hadoop.hbase.ServerName.compareTo(ServerName) >>>> 614 >>>>>>>>> 15.821 >>>>>>>>> org.apache.hadoop.hbase.ServerName.compareTo(Object) >> 614 >>>>>>>> 15.821 >>>>>>>> java.util.concurrent.ConcurrentSkipListMap.cpr(Comparator, >>>>>>>>> Object, Object) 614 15.821 >>>>>>>>> >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, int, >>>>>>>>> Comparator) >>>>>>>>> 614 15.821 >>>>>>>>> >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, int) >> 614 >>>>>>>> 15.821 >>>>>>>>> >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >> 614 >>>>>>>> 15.821 >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >>>>>>>> 614 >>>>>>>>> 15.821 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>> 614 >>>>>>>> 15.821 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>>>>> 614 >>>>>>>>> 15.821 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> 614 >>>>>>>> 15.821 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap$Node.createSnapshot() >>>> 468 >>>>>>>> 12.059 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, int) >>>> 468 >>>>>>>>> 12.059 >> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >>>>> 468 >>>>>>>>> 12.059 >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >>>>>>>> 468 >>>>>>>>> 12.059 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>>>>>>> 468 12.059 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>>>>> 468 >>>>>>>>> 12.059 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> 468 >>>>>>>> 12.059 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.cpr(Comparator, >> Object, >>>>>>>> Object) >>>>>>>>> 402 10.358 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, >> int, >>>>>>>>> Comparator) 280 7.215 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, >>>> int) >>>>>>>> 280 >>>>>>>>> 7.215 >>>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >>>>>>>> 280 >>>>>>>>> 7.215 >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >>>>>>>> 280 >>>>>>>>> 7.215 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>> 280 >>>>>>>> 7.215 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>> 280 >>>>>>>>> 7.215 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> 280 >>>>>>>> 7.215 >>>> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(Object, >>>>>>>>> Comparator) 122 3.144 >>>>>>>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, >>>> int, >>>>>>>>> Comparator) 122 3.144 >> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, >>>>> int) >>>>>>>>> 122 >>>>>>>>> 3.144 >>>>>>>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >>>>>>>>> 122 3.144 >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >>>>>>>> 122 >>>>>>>>> 3.144 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >>>> 122 >>>>>>>> 3.144 >>>> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >>>>>>>> 122 >>>>>>>>> 3.144 >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> 122 >>>>>>>> 3.144 >>>>>>>>> >>>>>>>>> >>>>>>>>> Just me? Someone working on ProcedureV2 should have a look at >> WTH >>>> the >>>>>>>>> ProcedureExecutors are doing. They seem to be polling the run >> queue >>>>>>>>> constantly and burning up CPU. I'll try running more workloads at >>>> the >>>>>>>> next >>>>>>>>> opportunity and see if I can repeat this or gain more insight. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> - Andy >>>>>>>>> >>>>>>>>> Problems worthy of attack prove their worth by hitting back. - >> Piet >>>>> Hein >>>>>>>>> (via Tom White) >>> >>> >>> >>> -- >>> Best regards, >>> >>> - Andy >>> >>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>> (via Tom White) >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >>
