Hi George,

The system you're simulating is quite a stress test for the Ruby protocol
you're using! What protocol have you compiled?

The problem you're running into could be very simple. It's possible that
due to the high bandwidth of the system, some of the queues in Ruby are
filling up and causing the average memory access latency to skyrocket due
to queuing delays. If this happens, the protocol could be "correct" but
still cause a deadlock detection. In this case, you may be able to increase
the deadlock threshold and see the application start to work again. We
often see this with GPU workloads.

However, it's more likely a bug somewhere in the protocol you're using. To
debug this, you'll need to dig into the protocol. The debug flag
"ProtocolTrace" is useful here. With this debug flag you'll see every
transition in Ruby. With this information you should be able to trace back
and find the memory operation that's causing the deadlock. I would also
suggest using "--debug-start=<tick>" and pick the highest tick value you
can before the offending operation (e.g., a little less than
5676227351000). Otherwise the trace may be 10s-100s of GB (and take days to
generate).

Hopefully this helps you get on the right track. Good luck!

Jason

On Wed, Aug 10, 2016 at 6:50 PM George Mappouras <
georgios.mappou...@duke.edu> wrote:

> Hi all,
>
> I had some trouble while running Parsec benchmarks with gem5 + Ruby (using
> MESI two level protocol). I found out that some of the benchmarks will
> cause gem5 to crush because a deadlock was detected. The configuration I
> use is the follow:
>
> I have 8 nodes connected to a ring. Each node is a core connected with a
> private 64KB L1 cache and one channel of High Bandwidth Memory (HBM). Also
> each core has one out of 8 banks of the shared 8MB L2 cache connected to
> them. The command I run looks like this:
>
> * ./build/X86/gem5.opt configs/example/fs.py
> --disk-image=x86root-parsec.img --kernel=x86_64-vmlinux-2.6.22.9.smp -n 8
> --cpu-type=detailed --cpu-clock=1GHz --caches --l1d_size=64kB
> --num-l2caches=8 --l2_size=8MB --mem-type=HBM_1000_4H_x128 --mem-channels=8
> --mem-size=2GB --ruby --num-dirs=8 --topology=Torus --mesh-rows=1
> --access-backing-store --script=a_parsec_script.sh*
>
> I use the latest version of gem5 and I have no problem booting or running
> commands on the simulated machine. However as i mentioned above some
> benchmarks cause gem5 to crush with a message like this:
>
> *panic: Possible Deadlock detected. Aborting!*
> *version: 1 request.paddr: 0x53d6a000 m_writeRequestTable: 1 current time:
> 5677053833000 issue_time: 5676227351000 difference: 826482000*
> * @ tick 5677053833000*
> *[wakeup:build/X86/mem/ruby/system/Sequencer.cc, line 119]*
> *Memory Usage: 5799824 KBytes*
> *Program aborted at tick 5677053833000*
> *--- BEGIN LIBC BACKTRACE ---*
> *./build/X86/gem5.opt(_Z15print_backtracev+0x15)[0x9f8275]*
> *./build/X86/gem5.opt(_Z12abortHandleri+0x36)[0xa09536]*
> */lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f3dd5cdf330]*
> */lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f3dd4530c37]*
> */lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f3dd4534028]*
> *./build/X86/gem5.opt(_Z15__exit_epilogueiPKcS0_iS0_+0x1ec)[0x9bba6c]*
>
> *./build/X86/gem5.opt(_Z14__exit_messageIIjmmmmmEEvPKciS1_S1_iS1_DpRKT_+0xc7)[0x93ca27]*
> *./build/X86/gem5.opt(_ZN9Sequencer6wakeupEv+0x266)[0x93a936]*
> *./build/X86/gem5.opt(_ZN10EventQueue10serviceOneEv+0xb1)[0xa018e1]*
> *./build/X86/gem5.opt(_Z9doSimLoopP10EventQueue+0x38)[0xa22938]*
> *./build/X86/gem5.opt(_Z8simulatem+0x1fb)[0xa22ebb]*
> *./build/X86/gem5.opt[0x969d7c]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x45f7)[0x7f3dd58f7af7]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f3dd58f954d]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f3dd58f7dd8]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59)[0x7f3dd58f8059]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59)[0x7f3dd58f8059]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f3dd58f954d]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7f3dd58f9682]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x563e)[0x7f3dd58f8b3e]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f3dd58f954d]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f3dd58f7dd8]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f3dd58f954d]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7f3dd58f9682]*
>
> */usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x79)[0x7f3dd58f34b9]*
> *./build/X86/gem5.opt(_Z6m5MainiPPc+0x5f)[0xa08caf]*
> *./build/X86/gem5.opt(main+0x33)[0x701933]*
> */lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f3dd451bf45]*
> *./build/X86/gem5.opt[0x725e83]*
> *--- END LIBC BACKTRACE ---*
>
> Anyone can help me figure out what the problem is? Am I missing something?
> Does my system configuration match the command I run? I would appreciate any
> help!
>
> Thanks,
> George
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to