It was definitely something to do with SSH. I was able to reproduce, then ran through this again but manually cleared away all files in WALs/* after shutdown, and then the profile was what I'd expect.
On Thu, Nov 5, 2015 at 8:41 PM, Andrew Purtell <[email protected]> wrote: > > > On Thu, Nov 5, 2015 at 12:18 PM, Matteo Bertozzi <[email protected]> > wrote: > >> what is the shutdown and restart? a clean stop? kill? >> > > The shutdown of a ^C of the minicluster as launched out of the dev tree. > The termination looked normal to be but I did see the SSH handler kick in > upon restart. > > is the second YCSB job run on the cloned table? >> > > Yes > > > >> >> the only master operation with procedures that I see here are >> - create meta & namespace table >> - create the table for ycsb >> - disable the table for snapshot >> - maybe enable for the table after restart? >> >> so, 5 procedure running let say 10 step each (exaggerating) you'll have 50 >> calls to poll() >> then we have the SSH procedure, how many server does the cluster have? >> > > The minicluster in master embeds a zk quorum peer, the master, and two > regionservers. > > > >> maybe we keep spinning on the server actions after restart... >> >> > That was a thought I had as well, spinning on poll of the master > procedure queue after restart. > > > >> >> >> Matteo >> >> >> On Thu, Nov 5, 2015 at 11:47 AM, Andrew Purtell <[email protected] >> > >> wrote: >> >> > Here's what I did: >> > >> > 0. Launch minicluster out of dev tree with ./bin/hbase master start >> > 1. Load 10M rows using YCSB >> > 2. Snapshot >> > 3. Shutdown and restart >> > 4. Clone a test table from the snapshot >> > 5. Run YCSB workload A for 2M operations >> > >> > The profile was captured at step 5. >> > >> > HTH >> > >> > >> > > On Nov 5, 2015, at 10:40 AM, Matteo Bertozzi <[email protected] >> > >> > wrote: >> > > >> > > what operation do you have going on the HMaster? >> > > I did a simple test where there are no operation going on and poll >> > doesn't >> > > get out of the waitCond.await() >> > > so unless you have operations on the master you'll never call >> > queue.poll() >> > > >> > > or maybe, there's a bug and the previous ops where hanging and >> trigger a >> > > loop. >> > > what did you executed on the master before? >> > > do you have a script or something that I can use to reproduce this? >> > > >> > > Matteo >> > > >> > > >> > > On Thu, Nov 5, 2015 at 10:25 AM, Matteo Bertozzi < >> > [email protected]> >> > > wrote: >> > > >> > >> let me take a look, but branch-1 and master should be the same on the >> > proc >> > >> stuff >> > >> >> > >> Matteo >> > >> >> > >> >> > >>> On Thu, Nov 5, 2015 at 10:23 AM, Stack <[email protected]> wrote: >> > >>> >> > >>> You are looking at master process only? >> > >>> >> > >>> I've been looking at branch-1 only of late. >> > >>> >> > >>> St.Ack >> > >>> >> > >>> On Thu, Nov 5, 2015 at 12:06 AM, Andrew Purtell < >> [email protected]> >> > >>> wrote: >> > >>> >> > >>>> I just profiled trunk running YCSB workload A. I was looking for >> > >>> something >> > >>>> else but found completely unexpected results. >> > >>>> >> > >>>> I see 60% of all allocation pressure in TLABs coming from >> > >>>> ProcedureExecutors, allocating 16 GB of >> > >>>> java.util.AbstractMap$SimpleImmutableEntry in this code: >> > >>>> >> > >>>> Stack Trace TLABs Total TLAB Size(bytes) Pressure(%) >> > >>>> java.util.concurrent.ConcurrentSkipListMap$Node.createSnapshot() >> > 11,603 >> > >>>> 18,402,380,120 100 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.firstEntry() 11,599 >> > >>>> 18,401,678,776 99.996 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >> > >>>> 11,599 18,401,678,776 99.996 >> > >>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> > >>>> 11,599 18,401,678,776 99.996 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > >>> 11,599 >> > >>>> 18,401,678,776 99.996 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> 11,599 >> > >>>> 18,401,678,776 99.996 >> > >>>> >> > >>>> >> > >>>> The three ProcedureExecutor threads also consume about 60% of all >> CPU >> > >>> time, >> > >>>> each about 20%. Here are the top hot code arcs from one of them: >> > >>>> >> > >>>> Stack Trace Sample Count Percentage(%) >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> 1,521 >> > >>>> 39.191 >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > >>>> 1,521 39.191 >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() >> > >>> 1,521 >> > >>>> 39.191 >> > >>>> java.lang.String$CaseInsensitiveComparator.compare(String, String) >> 614 >> > >>>> 15.821 >> > >>>> java.lang.String$CaseInsensitiveComparator.compare(Object, >> Object) >> > >>> 614 >> > >>>> 15.821 >> > >>>> java.lang.String.compareToIgnoreCase(String) 614 15.821 >> > >>>> org.apache.hadoop.hbase.ServerName.compareTo(ServerName) >> 614 >> > >>>> 15.821 >> > >>>> org.apache.hadoop.hbase.ServerName.compareTo(Object) 614 >> > >>> 15.821 >> > >>> java.util.concurrent.ConcurrentSkipListMap.cpr(Comparator, >> > >>>> Object, Object) 614 15.821 >> > >>>> >> > >>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, int, >> > >>>> Comparator) >> > >>>> 614 15.821 >> > >>>> >> > >>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, int) 614 >> > >>> 15.821 >> > >>>> >> > >>>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) 614 >> > >>> 15.821 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >> > >>> 614 >> > >>>> 15.821 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> 614 >> > >>> 15.821 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > >>> 614 >> > >>>> 15.821 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() 614 >> > >>> 15.821 >> > >>>> java.util.concurrent.ConcurrentSkipListMap$Node.createSnapshot() >> 468 >> > >>> 12.059 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, int) >> 468 >> > >>>> 12.059 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >> > 468 >> > >>>> 12.059 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >> > >>> 468 >> > >>>> 12.059 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> > >>>> 468 12.059 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > >>> 468 >> > >>>> 12.059 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() 468 >> > >>> 12.059 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.cpr(Comparator, Object, >> > >>> Object) >> > >>>> 402 10.358 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, int, >> > >>>> Comparator) 280 7.215 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, >> int) >> > >>> 280 >> > >>>> 7.215 >> > >>>> >> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >> > >>> 280 >> > >>>> 7.215 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >> > >>> 280 >> > >>>> 7.215 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> 280 >> > >>> 7.215 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > 280 >> > >>>> 7.215 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() 280 >> > >>> 7.215 >> > >>>> >> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(Object, >> > >>>> Comparator) 122 3.144 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.findNear(Object, >> int, >> > >>>> Comparator) 122 3.144 >> > >>>> java.util.concurrent.ConcurrentSkipListMap.getNear(Object, >> > int) >> > >>>> 122 >> > >>>> 3.144 >> > >>> java.util.concurrent.ConcurrentSkipListMap.higherEntry(Object) >> > >>>> 122 3.144 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.nextObject() >> > >>> 122 >> > >>>> 3.144 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureFairRunQueues.poll() >> 122 >> > >>> 3.144 >> > >>>> >> > >>>> >> org.apache.hadoop.hbase.master.procedure.MasterProcedureQueue.poll() >> > >>> 122 >> > >>>> 3.144 >> > >>>> >> > >>>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop() 122 >> > >>> 3.144 >> > >>>> >> > >>>> >> > >>>> Just me? Someone working on ProcedureV2 should have a look at WTH >> the >> > >>>> ProcedureExecutors are doing. They seem to be polling the run queue >> > >>>> constantly and burning up CPU. I'll try running more workloads at >> the >> > >>> next >> > >>>> opportunity and see if I can repeat this or gain more insight. >> > >>>> >> > >>>> >> > >>>> -- >> > >>>> Best regards, >> > >>>> >> > >>>> - Andy >> > >>>> >> > >>>> Problems worthy of attack prove their worth by hitting back. - Piet >> > Hein >> > >>>> (via Tom White) >> > >> >> > >> >> > >> > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
