Re: [gem5-users] Issues while Draining the CPUs

Srinivasan Narayanamoorthy Tue, 15 Apr 2014 18:12:12 -0700

Hi ,For the second issue, After drain, I am simply checking if all the 
instructions in InstToExecute list is squashed and if they are, I am clearing 
the list. If not I am exiting simulation. (Since cpu is drained and InstList is 
empty, all entries in InstToExecute list must be squashed).



For the first issue, I am not familiar with how ruby handles drain. I am sure 
the seniors in this group will know the answer to your questions.


Thanks
Srini 

On 04/15/14, pushkar nandkar 
 wrote:
> Hi,
> Srini, Thanks for that. Do you have any workaround for the second issue?
> 
> 
> About the first problem, I debugged a bit further. However I not able to 
> think what can done next to debug further.
> 
> 
> I checked into the source code, activated some 
> flags(SimpleCPU,Activity,O3CPU,Quiesce,Drain). 
> I have attached tracefiles for bodytrack and Canneal(runs well). The 
> difference can be clearly seen.
> 
> 
> 
> In the tracefile I can see that, suspended processor is awoke for CPU1 and 
> CPU3 (TimingSimpleCPU::wakeup()). Meanwhile CPU2 and CPU0 are being drained. 
> There are no wakeup call for CPU2 and CPU0.
> These happen for both the benchmarks and therefore I dont think it is the 
> issue with wakeup/quiesce.
> 
> 
> 
> 
> Whenever a drain is called, it schedules a fetch event 
> TimingSimpleCPU::drain()
> During a fetch event, it will sendfetch() which calls 
> icachePort.sendTimingReq().
> TimingSimpleCPU::IcachePort::recvTimingResp() fetches the packet using 
> TimingSimpleCPU::completeIfetch() in which it calls advanceInst() which does 
> tryCompleteDrain(). 
> 
> In this function it is checked whether the drain manager is null or not and 
> drains the CPU if it is not null.
> 
> 
> For Canneal, this goes on pretty well(see tracefile)
> 
> 
> However for Bodytrack, drain starts for CPU1 and CPU3 pretty late compared to 
> canneal(see tracefile). There are many fetch events before the draining 
> actually starts.
> When finally TimingSimpleCPU::drain() for CPU1 is called, the fetch event is 
> not scheduled since _status = BaseSimpleCPU::DcacheWaitResponse (see 
> TimingSimpleCPU::drain())
> 
> 
> Same goes for CPU3. However, what I could find is the events 
> L1Cache_Controller::wakeup and PacketQueue::processSendEvent() drained 
> CPU3_L1 Controller and CPU3 respectively. However, the same does not occur 
> for CPU1. 
> Hence CPU1 never gets drained.
> 
> 
> The simulator gets stuck in the doSimLoop and never comes out. It keeps 
> spinning in the loop with no progress.
> 
> 
> Any help for debugging further will be great!
> 
> 
> Thanks
> -Pushkar
> 
> 
> 
> On Mon, Apr 14, 2014 at 7:27 PM, Srinivasan Narayanamoorthy 
> <[email protected](javascript:main.compose()> wrote:
> 
> > Hi 
> > 
> > 1) For the first problem, I suspect that the threadcontext is suspended and 
> > for some reason it never wakes up. You can look for any quiesce() 
> > instructions that is not accompanied by a respective wakeup().
> > 
> > 
> > 2) The basic problem here is that a drain is signalled when a pipelined op 
> > whose issue latency > 1 is in the execute pipeline. (for example a 
> > multiply). When the corresponding FUCompletion is processed, the 
> > InstsToExecute list is populated and hence drainSanity check fails.
> > I am currently checking if all the instructions in the list are squashed 
> > after drain is signalled and clearing the list.
> > 
> > 
> > Thanks
> > Srini 
> > 
> > On 04/14/14, pushkar nandkar
> > wrote:
> > > Hi All,
> > >
> > > There are three issues I am facing right now and I could not figure out a 
> > > solution/workaround for that. May be with your help I can.
> > >
> > >
> > > 1. CPU do not get drained.
> > > Command line : build/ALPHA_MESI_CMP_directory/gem5.opt --debug-flag=Drain 
> > > --debug-file=trace.out -d m5out/OutPutDir configs/example/ruby_fs.py -n 4 
> > > --cpu-type=detailed --restore-with-cpu=timing 
> > > --checkpoint-dir=parsec/bodytrack/simsmall/roi-chk_4CPU -r 1 --caches 
> > > --l1i_size=64kB --l1i_assoc=2 --l1d_size=64kB --l1d_assoc=2 --l2cache 
> > > --l2_size=1MB --l2_assoc=8 --mem-size=1024MB --prog-interval=100Hz
> > >
> > >
> > > After execution, the simulation go into the doSimLoop and never come out. 
> > > There is no exit event.
> > > This is what I can see at the output
> > > 2417238392000: Event_196: system.cpu3 progress event, total committed:13, 
> > > progress insts committed: 13, IPC: 06.5e-07
> > > 2417238392000: Event_195: system.cpu2 progress event, total committed:1, 
> > > progress insts committed: 1, IPC: 0005e-08
> > > 2417238392000: Event_194: system.cpu1 progress event, total committed:12, 
> > > progress insts committed: 12, IPC: 0006e-07
> > > 2417238392000: Event_193: system.cpu0 progress event, total committed:1, 
> > > progress insts committed: 1, IPC: 0005e-08
> > > 2417238392000: Event_192: system.switch_cpus3 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2417238392000: Event_191: system.switch_cpus2 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2417238392000: Event_190: system.switch_cpus1 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2417238392000: Event_189: system.switch_cpus0 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_189: system.switch_cpus0 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_190: system.switch_cpus1 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_191: system.switch_cpus2 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_192: system.switch_cpus3 progress event, total 
> > > committed:0, progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_193: system.cpu0 progress event, total committed:1, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_194: system.cpu1 progress event, total committed:12, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_195: system.cpu2 progress event, total committed:1, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2427238392000: Event_196: system.cpu3 progress event, total committed:13, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2437238392000: Event_196: system.cpu3 progress event, total committed:13, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2437238392000: Event_195: system.cpu2 progress event, total committed:1, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2437238392000: Event_194: system.cpu1 progress event, total committed:12, 
> > > progress insts committed: 0, IPC: 00000000
> > > 2437238392000: Event_193: system.cpu0 progress event, total committed:1, 
> > > progress insts committed: 0, IPC: 00000000
> > >
> > >
> > >
> > > There is no progress at all for the switched cpus.
> > > I debugged using the drain flags.
> > > As the below trace file shows, the CPU1 never gets drained out of the 4 
> > > simulated.
> > >
> > >
> > > 0: system.tsunami.io.rtc: Real-time clock set to Thu Jan 1 00:00:00 2009
> > > 2407238402000: system.ruby.l1_cntrl0.sequencer: RubyPort not drained
> > > 2407238402000: system.ruby.l1_cntrl2.sequencer: RubyPort not drained
> > > 2407238402000: system.cpu0: Requesting drain: 
> > > (0xfffffc0000319980=>0xfffffc0000319984)
> > > 2407238402000: system.cpu1: No need to drain.
> > > 2407238402000: system.cpu2: Requesting drain: (0x1200f82ec=>0x1200f82f0)
> > > 2407238402000: system.cpu3: No need to drain.
> > > 2408203163500: system.ruby.l1_cntrl2.sequencer: Drain count: 0
> > > 2408203163500: system.ruby.l1_cntrl2.sequencer: RubyPort done draining, 
> > > signaling drain done
> > > 2408203164000: system.cpu2: tryCompleteDrain: (0x1200f7ed4=>0x1200f7ed8)
> > > 2408203164000: system.cpu2: CPU done draining, processing drain event
> > > 2408203171000: system.ruby.l1_cntrl0.sequencer: Drain count: 0
> > > 2408203171000: system.ruby.l1_cntrl0.sequencer: RubyPort done draining, 
> > > signaling drain done
> > > 2408203209000: system.cpu0: tryCompleteDrain: 
> > > (0xfffffc0000319984=>0xfffffc0000319988)
> > > 2408203209000: system.cpu0: CPU done draining, processing drain event
> > > 2408203209000: system.ruby.l1_cntrl1.sequencer: RubyPort not drained
> > > 2408203209000: system.ruby.l1_cntrl3.sequencer: RubyPort not drained
> > > 2408203209000: system.cpu0: No need to drain.
> > > 2408203209000: system.cpu1: Requesting drain: (0x4139=>0x413d)
> > > 2408203209000: system.cpu2: No need to drain.
> > > 2408203209000: system.cpu3: Requesting drain: (0x4139=>0x413d)
> > > 2408203228000: system.ruby.l1_cntrl3.sequencer: Drain count: 0
> > > 2408203228000: system.ruby.l1_cntrl3.sequencer: RubyPort done draining, 
> > > signaling drain done
> > > 2408203228500: system.cpu3: tryCompleteDrain: (0x413d=>0x4141)
> > > 2408203228500: system.cpu3: CPU done draining, processing drain event
> > > 2417238392000: Event_196: system.cpu3 progress event, total committed:13, 
> > > progress insts committed: 13, IPC: 06.5e-07
> > >
> > > ....(as above)
> > >
> > >
> > > When I kill the run, I get the following error :
> > 
> > 
> > > gem5.opt: build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3233: 
> > > void cleanupDrainManager(DrainManager*): Assertion 
> > > `drain_manager->getCount() == 0&#39; failed.
> > >
> > >
> > >
> > > To debug further, I use gdb for breakpoints at the start of 
> > > cleanupDrainManager in 
> > > build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc. However, the 
> > > first cleanup it calls after 2408203209000, it comes out of the function, 
> > > drain_manager->getCount() returns 0 and the cleanupDrainManager is never 
> > > called again, it goes in the same simulation loop after continuing and 
> > > never comes out
> > >
> > >
> > > (gdb) c
> > > Continuing.
> > >
> > >
> > > Breakpoint 4, cleanupDrainManager (drain_manager=0xd4e9a50)
> > > at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3232
> > > 3232 assert(drain_manager);
> > > (gdb) s
> > > 3233 assert(drain_manager->getCount() == 0);
> > > (gdb) p drain_manager->getCount()
> > > $1 = 0
> > > (gdb) s
> > > DrainManager::getCount (this=0xd4e9a50)
> > > at build/ALPHA_MESI_CMP_directory/sim/drain.hh:78
> > > 78 unsigned int getCount() const { return _count; }
> > > (gdb)
> > > cleanupDrainManager (drain_manager=0xd4e9a50)
> > > at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3234
> > > 3234 delete drain_manager;
> > > (gdb)
> > > DrainManager::~DrainManager (this=0xd4e9a50, __in_chrg=<optimized out>)
> > > at build/ALPHA_MESI_CMP_directory/sim/drain.cc:50
> > > 50 }
> > > (gdb) n
> > > cleanupDrainManager (drain_manager=0xd4e9a50)
> > > at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3235
> > > 3235 }
> > > (gdb) s
> > > _wrap_cleanupDrainManager (args=0x3804190)
> > > at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3524
> > > 3524 resultobj = SWIG_Py_Void();
> > > (gdb) n
> > > 3525 return resultobj;
> > > (gdb)
> > > 3528 }
> > > (gdb)
> > > 0x00007ffff775c5d5 in PyEval_EvalFrameEx () from 
> > > /usr/lib/libpython2.7.so.1.0
> > >
> > >
> > >
> > > (gdb) c
> > > Continuing.
> > > info: Entering event queue @ 2408203209000. Starting simulation...
> > >
> > > (its now in the loop and does not come out)
> > >
> > >
> > > the -maxtick flag doesnt work to stop the simulation.
> > >
> > >
> > > I am running PARSEC benchmarks. This does not happen for every benchmark. 
> > > For eg Bodytrack shows this behavior however Canneal runs fine.
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2. Assertion Fails
> > >
> > >
> > > --debug-flags=Drain --trace-file=trace.out -d m5out/OutputDIR 
> > > configs/example/ruby_fs.py --ruby -n 4 --repeat-switch=20 
> > > --repeat-time-simple=10000000000 --repeat-time-detailed=1000000000 
> > > --cpu-type=detailed --restore-with-cpu=timing 
> > > --checkpoint-dir=parsec/swaptions/simmedium/multi-chk -r 6 --caches 
> > > --l1i_size=64kB --l1i_assoc=2 --l1d-cachebank --l1d-data-write-latency=1 
> > > --l1d_size=64kB --l1d_assoc=2 --l2cache --l2_size=4MB --l2_assoc=8 
> > > --mem-size=1024MB --prog-interval=10Hz
> > >
> > >
> > > Here I get the following assertion
> > 
> > 
> > > gem5.debug: build/ALPHA_MESI_CMP_directory/cpu/o3/inst_queue_impl.hh:443: 
> > > void InstructionQueue<Impl>::drainSanityCheck() const [with Impl = 
> > > O3CPUImpl]: Assertion `instsToExecute.empty()&#39; failed
> > >
> > >
> > >
> > > using gdb it indeed shows that the instsToExecute is not empty just 
> > > before the assertion occurs
> > >
> > >
> > > (gdb) plist instsToExecute DynInstPtr
> > > elem[0]: $129 = {
> > > data = 0x4cf0e00
> > > }
> > > elem[1]: $130 = {
> > > data = 0x4d24100
> > > }
> > > List size = 2
> > > (gdb) p name()
> > 
> > > $131 = 
> > > "system.repeat_switch_cpus1.iq(http://system.repeat_switch_cpus1.iq)(http://system.repeat_switch_cpus1.iq)"
> > > (gdb) backtrace
> > > #0 InstructionQueue<O3CPUImpl>::drainSanityCheck (this=0x354e660)
> > > at build/ALPHA_MESI_CMP_directory/cpu/o3/inst_queue_impl.hh:443
> > > #1 0x000000000097ea7f in DefaultIEW<O3CPUImpl>::drainSanityCheck (
> > > this=0x354e2c0) at build/ALPHA_MESI_CMP_directory/cpu/o3/iew_impl.hh:396
> > > #2 0x000000000091592f in FullO3CPU<O3CPUImpl>::drainSanityCheck (
> > > this=0x354d410) at build/ALPHA_MESI_CMP_directory/cpu/o3/cpu.cc:1187
> > > #3 0x0000000000929cd4 in FullO3CPU<O3CPUImpl>::drain (this=0x354d410,
> > > drain_manager=0x2eb00f0)
> > > at build/ALPHA_MESI_CMP_directory/cpu/o3/cpu.cc:1157
> > > #4 0x0000000000b912bd in _wrap_Drainable_drain (args=0x28aaa70)
> > > at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3142
> > > #5 0x0000003958ed55c6 in PyEval_EvalFrameEx ()
> > > from /usr/lib64/libpython2.6.so.1.0
> > > #6 0x0000003958ed7657 in PyEval_EvalCodeEx ()
> > > from /usr/lib64/libpython2.6.so.1.0
> > > #7 0x0000003958ed5aa4 in PyEval_EvalFrameEx ()
> > > from /usr/lib64/libpython2.6.so.1.0
> > > #8 0x0000003958e60917 in ?? () from /usr/lib64/libpython2.6.so.1.0
> > > #9 0x0000003958e42b4b in PyIter_Next () from 
> > > /usr/lib64/libpython2.6.so.1.0
> > > #10 0x0000003958ecb036 in ?? () from /usr/lib64/libpython2.6.so.1.0
> > > #11 0x0000003958ed59e4 in PyEval_EvalFrameEx ()
> > > from /usr/lib64/libpython2.6.so.1.0
> > > #12 0x0000003958ed7657 in PyEval_EvalCodeEx ()
> > >
> > > ....
> > >
> > >
> > > In cpu/o3/iew_impl.hh, in executeInsts(), the instructions for execution 
> > > are taken from instsToExecute list.
> > > However, the number of instructions it wants to execute is taken from 
> > > issue stage
> > > fromIssue->size (cpu/o3/iew_impl.hh:1221) instead of taking into 
> > > consideration the number of instructions waiting to be Executed. and 
> > > therefore the instsToExecute list does not get empty and throws the 
> > > assertion.
> > >
> > >
> > >
> > > in cpu/o3/inst_queue_impl.hh there are three push backs to instsToExecute 
> > > list and only one pop, when it wants to get the instruction for execution.
> > >
> > >
> > >
> > >
> > >
> > > 3. GDB
> > > Using gdb, once the run gets to any or *_wrap.cc files or in python/swig 
> > > directory, the runs do not come back where it left or where other 
> > > functions are called unless there is breakpoint somewhere in the code.
> > > it shows the following
> > >
> > >
> > > Single stepping until exit from function PyEval_EvalFrameEx,
> > > which has no line number information.
> > > 0x00007ffff771c6b5 in PyEval_EvalCodeEx () from 
> > > /usr/lib/libpython2.7.so.1.0
> > >
> > > (gdb)
> > > Single stepping until exit from function PyEval_EvalCodeEx,
> > > which has no line number information.
> > > 0x00007ffff775c650 in PyEval_EvalFrameEx () from 
> > > /usr/lib/libpython2.7.so.1.0
> > >
> > >
> > > The simulation continues in the background and there is no way to get 
> > > inside the simulator code.
> > >
> > >
> > > Could anyone explain this behavior? or guide me to some useful documents?
> > >
> > >
> > >
> > >
> > > Thanks,
> > >
> > > -Pushkar
> > 
> > 
> > _______________________________________________
> > gem5-users mailing list
> > [email protected](javascript:main.compose()
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Issues while Draining the CPUs

Reply via email to