Re: [gem5-users] Issues while Draining the CPUs

Srinivasan Narayanamoorthy Mon, 14 Apr 2014 17:28:43 -0700

Hi 

1) For the first problem, I suspect that the threadcontext is suspended and for 
some reason it never wakes up. You can look for any quiesce() instructions that 
is not accompanied by a respective wakeup().



2) The basic problem here is that a drain is signalled when a pipelined op 
whose issue latency > 1 is in the execute pipeline. (for example a multiply). 
When the corresponding FUCompletion is processed, the InstsToExecute list is 
populated and hence drainSanity check fails. 
I am currently checking if all the instructions in the list are squashed after 
drain is signalled and clearing the list.


Thanks
Srini 

On 04/14/14, pushkar nandkar 
 wrote:
> Hi All,
> 
> There are three issues I am facing right now and I could not figure out a 
> solution/workaround for that. May be with your help I can.
> 
> 
> 1. CPU do not get drained. 
> Command line : build/ALPHA_MESI_CMP_directory/gem5.opt --debug-flag=Drain 
> --debug-file=trace.out -d m5out/OutPutDir configs/example/ruby_fs.py -n 4 
> --cpu-type=detailed --restore-with-cpu=timing 
> --checkpoint-dir=parsec/bodytrack/simsmall/roi-chk_4CPU -r 1 --caches 
> --l1i_size=64kB --l1i_assoc=2 --l1d_size=64kB --l1d_assoc=2 --l2cache 
> --l2_size=1MB --l2_assoc=8 --mem-size=1024MB --prog-interval=100Hz
> 
> 
> After execution, the simulation go into the doSimLoop and never come out. 
> There is no exit event.
> This is what I can see at the output
> 2417238392000: Event_196: system.cpu3 progress event, total committed:13, 
> progress insts committed: 13, IPC: 06.5e-07
> 2417238392000: Event_195: system.cpu2 progress event, total committed:1, 
> progress insts committed: 1, IPC: 0005e-08
> 2417238392000: Event_194: system.cpu1 progress event, total committed:12, 
> progress insts committed: 12, IPC: 0006e-07
> 2417238392000: Event_193: system.cpu0 progress event, total committed:1, 
> progress insts committed: 1, IPC: 0005e-08
> 2417238392000: Event_192: system.switch_cpus3 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2417238392000: Event_191: system.switch_cpus2 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2417238392000: Event_190: system.switch_cpus1 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2417238392000: Event_189: system.switch_cpus0 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_189: system.switch_cpus0 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_190: system.switch_cpus1 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_191: system.switch_cpus2 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_192: system.switch_cpus3 progress event, total 
> committed:0, progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_193: system.cpu0 progress event, total committed:1, 
> progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_194: system.cpu1 progress event, total committed:12, 
> progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_195: system.cpu2 progress event, total committed:1, 
> progress insts committed: 0, IPC: 00000000
> 2427238392000: Event_196: system.cpu3 progress event, total committed:13, 
> progress insts committed: 0, IPC: 00000000
> 2437238392000: Event_196: system.cpu3 progress event, total committed:13, 
> progress insts committed: 0, IPC: 00000000
> 2437238392000: Event_195: system.cpu2 progress event, total committed:1, 
> progress insts committed: 0, IPC: 00000000
> 2437238392000: Event_194: system.cpu1 progress event, total committed:12, 
> progress insts committed: 0, IPC: 00000000
> 2437238392000: Event_193: system.cpu0 progress event, total committed:1, 
> progress insts committed: 0, IPC: 00000000
> 
> 
> 
> There is no progress at all for the switched cpus.
> I debugged using the drain flags.
> As the below trace file shows, the CPU1 never gets drained out of the 4 
> simulated.
> 
> 
> 0: system.tsunami.io.rtc: Real-time clock set to Thu Jan 1 00:00:00 2009
> 2407238402000: system.ruby.l1_cntrl0.sequencer: RubyPort not drained
> 2407238402000: system.ruby.l1_cntrl2.sequencer: RubyPort not drained
> 2407238402000: system.cpu0: Requesting drain: 
> (0xfffffc0000319980=>0xfffffc0000319984)
> 2407238402000: system.cpu1: No need to drain.
> 2407238402000: system.cpu2: Requesting drain: (0x1200f82ec=>0x1200f82f0)
> 2407238402000: system.cpu3: No need to drain.
> 2408203163500: system.ruby.l1_cntrl2.sequencer: Drain count: 0
> 2408203163500: system.ruby.l1_cntrl2.sequencer: RubyPort done draining, 
> signaling drain done
> 2408203164000: system.cpu2: tryCompleteDrain: (0x1200f7ed4=>0x1200f7ed8)
> 2408203164000: system.cpu2: CPU done draining, processing drain event
> 2408203171000: system.ruby.l1_cntrl0.sequencer: Drain count: 0
> 2408203171000: system.ruby.l1_cntrl0.sequencer: RubyPort done draining, 
> signaling drain done
> 2408203209000: system.cpu0: tryCompleteDrain: 
> (0xfffffc0000319984=>0xfffffc0000319988)
> 2408203209000: system.cpu0: CPU done draining, processing drain event
> 2408203209000: system.ruby.l1_cntrl1.sequencer: RubyPort not drained
> 2408203209000: system.ruby.l1_cntrl3.sequencer: RubyPort not drained
> 2408203209000: system.cpu0: No need to drain.
> 2408203209000: system.cpu1: Requesting drain: (0x4139=>0x413d)
> 2408203209000: system.cpu2: No need to drain.
> 2408203209000: system.cpu3: Requesting drain: (0x4139=>0x413d)
> 2408203228000: system.ruby.l1_cntrl3.sequencer: Drain count: 0
> 2408203228000: system.ruby.l1_cntrl3.sequencer: RubyPort done draining, 
> signaling drain done
> 2408203228500: system.cpu3: tryCompleteDrain: (0x413d=>0x4141)
> 2408203228500: system.cpu3: CPU done draining, processing drain event
> 2417238392000: Event_196: system.cpu3 progress event, total committed:13, 
> progress insts committed: 13, IPC: 06.5e-07
> 
> ....(as above)
> 
> 
> When I kill the run, I get the following error : 
> gem5.opt: build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3233: void 
> cleanupDrainManager(DrainManager*): Assertion `drain_manager->getCount() == 
> 0&#39; failed.
> 
> 
> 
> To debug further, I use gdb for breakpoints at the start of 
> cleanupDrainManager in 
> build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc. However, the first 
> cleanup it calls after 2408203209000, it comes out of the function, 
> drain_manager->getCount() returns 0 and the cleanupDrainManager is never 
> called again, it goes in the same simulation loop after continuing and never 
> comes out
> 
> 
> (gdb) c
> Continuing.
> 
> 
> Breakpoint 4, cleanupDrainManager (drain_manager=0xd4e9a50)
> at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3232
> 3232 assert(drain_manager);
> (gdb) s
> 3233 assert(drain_manager->getCount() == 0);
> (gdb) p drain_manager->getCount()
> $1 = 0
> (gdb) s
> DrainManager::getCount (this=0xd4e9a50)
> at build/ALPHA_MESI_CMP_directory/sim/drain.hh:78
> 78 unsigned int getCount() const { return _count; }
> (gdb) 
> cleanupDrainManager (drain_manager=0xd4e9a50)
> at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3234
> 3234 delete drain_manager;
> (gdb) 
> DrainManager::~DrainManager (this=0xd4e9a50, __in_chrg=<optimized out>)
> at build/ALPHA_MESI_CMP_directory/sim/drain.cc:50
> 50 }
> (gdb) n
> cleanupDrainManager (drain_manager=0xd4e9a50)
> at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3235
> 3235 }
> (gdb) s
> _wrap_cleanupDrainManager (args=0x3804190)
> at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3524
> 3524 resultobj = SWIG_Py_Void();
> (gdb) n
> 3525 return resultobj;
> (gdb) 
> 3528 }
> (gdb) 
> 0x00007ffff775c5d5 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> 
> 
> 
> (gdb) c
> Continuing.
> info: Entering event queue @ 2408203209000. Starting simulation...
> 
> (its now in the loop and does not come out)
> 
> 
> the -maxtick flag doesnt work to stop the simulation.
> 
> 
> I am running PARSEC benchmarks. This does not happen for every benchmark. For 
> eg Bodytrack shows this behavior however Canneal runs fine.
> 
> 
> 
> 
> 
> 
> 2. Assertion Fails
> 
> 
> --debug-flags=Drain --trace-file=trace.out -d m5out/OutputDIR 
> configs/example/ruby_fs.py --ruby -n 4 --repeat-switch=20 
> --repeat-time-simple=10000000000 --repeat-time-detailed=1000000000 
> --cpu-type=detailed --restore-with-cpu=timing 
> --checkpoint-dir=parsec/swaptions/simmedium/multi-chk -r 6 --caches 
> --l1i_size=64kB --l1i_assoc=2 --l1d-cachebank --l1d-data-write-latency=1 
> --l1d_size=64kB --l1d_assoc=2 --l2cache --l2_size=4MB --l2_assoc=8 
> --mem-size=1024MB --prog-interval=10Hz 
> 
> 
> Here I get the following assertion
> gem5.debug: build/ALPHA_MESI_CMP_directory/cpu/o3/inst_queue_impl.hh:443: 
> void InstructionQueue<Impl>::drainSanityCheck() const [with Impl = 
> O3CPUImpl]: Assertion `instsToExecute.empty()&#39; failed
> 
> 
> 
> using gdb it indeed shows that the instsToExecute is not empty just before 
> the assertion occurs
> 
> 
> (gdb) plist instsToExecute DynInstPtr
> elem[0]: $129 = {
> data = 0x4cf0e00
> }
> elem[1]: $130 = {
> data = 0x4d24100
> }
> List size = 2 
> (gdb) p name()
> $131 = "system.repeat_switch_cpus1.iq(http://system.repeat_switch_cpus1.iq)"
> (gdb) backtrace
> #0 InstructionQueue<O3CPUImpl>::drainSanityCheck (this=0x354e660)
> at build/ALPHA_MESI_CMP_directory/cpu/o3/inst_queue_impl.hh:443
> #1 0x000000000097ea7f in DefaultIEW<O3CPUImpl>::drainSanityCheck (
> this=0x354e2c0) at build/ALPHA_MESI_CMP_directory/cpu/o3/iew_impl.hh:396
> #2 0x000000000091592f in FullO3CPU<O3CPUImpl>::drainSanityCheck (
> this=0x354d410) at build/ALPHA_MESI_CMP_directory/cpu/o3/cpu.cc:1187
> #3 0x0000000000929cd4 in FullO3CPU<O3CPUImpl>::drain (this=0x354d410, 
> drain_manager=0x2eb00f0)
> at build/ALPHA_MESI_CMP_directory/cpu/o3/cpu.cc:1157
> #4 0x0000000000b912bd in _wrap_Drainable_drain (args=0x28aaa70)
> at build/ALPHA_MESI_CMP_directory/python/swig/drain_wrap.cc:3142
> #5 0x0000003958ed55c6 in PyEval_EvalFrameEx ()
> from /usr/lib64/libpython2.6.so.1.0
> #6 0x0000003958ed7657 in PyEval_EvalCodeEx ()
> from /usr/lib64/libpython2.6.so.1.0
> #7 0x0000003958ed5aa4 in PyEval_EvalFrameEx ()
> from /usr/lib64/libpython2.6.so.1.0
> #8 0x0000003958e60917 in ?? () from /usr/lib64/libpython2.6.so.1.0
> #9 0x0000003958e42b4b in PyIter_Next () from /usr/lib64/libpython2.6.so.1.0
> #10 0x0000003958ecb036 in ?? () from /usr/lib64/libpython2.6.so.1.0
> #11 0x0000003958ed59e4 in PyEval_EvalFrameEx ()
> from /usr/lib64/libpython2.6.so.1.0
> #12 0x0000003958ed7657 in PyEval_EvalCodeEx ()
> 
> ....
> 
> 
> In cpu/o3/iew_impl.hh, in executeInsts(), the instructions for execution are 
> taken from instsToExecute list. 
> However, the number of instructions it wants to execute is taken from issue 
> stage 
> fromIssue->size (cpu/o3/iew_impl.hh:1221) instead of taking into 
> consideration the number of instructions waiting to be Executed. and 
> therefore the instsToExecute list does not get empty and throws the assertion.
> 
> 
> 
> in cpu/o3/inst_queue_impl.hh there are three push backs to instsToExecute 
> list and only one pop, when it wants to get the instruction for execution.
> 
> 
> 
> 
> 
> 3. GDB 
> Using gdb, once the run gets to any or *_wrap.cc files or in python/swig 
> directory, the runs do not come back where it left or where other functions 
> are called unless there is breakpoint somewhere in the code.
> it shows the following 
> 
> 
> Single stepping until exit from function PyEval_EvalFrameEx,
> which has no line number information.
> 0x00007ffff771c6b5 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
> 
> (gdb) 
> Single stepping until exit from function PyEval_EvalCodeEx,
> which has no line number information.
> 0x00007ffff775c650 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> 
> 
> The simulation continues in the background and there is no way to get inside 
> the simulator code. 
> 
> 
> Could anyone explain this behavior? or guide me to some useful documents?
> 
> 
> 
> 
> Thanks,
> 
> -Pushkar
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Issues while Draining the CPUs

Reply via email to