Re: [gem5-users] System Hangs

Ivan Stalev via gem5-users Mon, 09 Jun 2014 10:04:25 -0700

Hi Jiuyue,

Thanks for sharing that information. I applied your patch and the problem
appears to be fixed! I will therefore continue to use the classic memory
model with X86 FS detailed CPUs and will post if I run into any further
issues. So it appears that this was a kernel bug and not a GEM5 bug.


Ivan


On Mon, Jun 9, 2014 at 4:37 AM, 马久跃 via gem5-users <gem5-users@gem5.org>
wrote:

> Hi all,
>
> I found someone in linux-rt community had report the same uart deadlock
> problem (http://www.spinics.net/lists/linux-rt-users/msg09246.html), they
> point out a recursive dead lock in uart driver, the call stack is show as
> following:
>
> ------------------------------------------------------------
> mpc52xx_uart_int()
>
>   *lock(port->lock);*
>
>     mpc52xx_psc_handle_irq()
>
>       mpc52xx_uart_int_tx_chars()
>
>         uart_write_wakeup()
>
>           tty_wakeup()
>
>             hci_uart_tx_wakeup()
>
>               len = tty->ops->write(tty, skb->data, skb->len);
>
>       The associated write function is uart_write
>
>       uart_write()
>
> lock(port->lock)  *--> deadlock*
>  ------------------------------------------------------------
>
> it seems uart driver use a single lock (port->lock) for both "struct 
> uart_port"
> and "struct uart_info", which caused the recursive deadlock.
>
> I have make a patch (based on linux-2.6.22 kernel) for this problem, it
> did following things to avoid deadlock:
>  - add extra lock for struct uart_info
>  - protect struct uart_info using port->info->lock instead of port->lock
>
> it works fine for me, but I'm not sure if it can really solve this system
> hangs problem. You guys can try this patch.
>
>
> Jiuyue Ma
>
>
> ------------------------------
> To: ids...@psu.edu; gem5-users@gem5.org; emilio.casti...@unican.es
> Date: Mon, 9 Jun 2014 08:49:22 +0100
>
> Subject: Re: [gem5-users] System Hangs
> From: gem5-users@gem5.org
>
>
> Hi all,
>
>  It would be valuable if someone actually dig in and fixed the issue to
> that the non-Ruby memory system works for X86. I would imagine that the
> speed/fidelity trade-off of the classic memory system appeals to a wider
> audience. If not, there is always ARM :-).
>
>  Andreas
>
>   From: Ivan Stalev via gem5-users <gem5-users@gem5.org>
> Reply-To: Ivan Stalev <ids...@psu.edu>, gem5 users mailing list <
> gem5-users@gem5.org>
> Date: Monday, 9 June 2014 06:29
> To: "Castillo Villar, Emilio" <emilio.casti...@unican.es>
> Cc: gem5 users mailing list <gem5-users@gem5.org>
> Subject: Re: [gem5-users] System Hangs
>
>  Hi,
>
>  Thank you for verifying and reproducing the bug. I am now attempting to
> run with Ruby as you suggested; however, I am getting a seg fault during
> boot-up. I compile like this:
>
>  scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer
>
>  and then run/boot-up like this:
>
>  build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> *--ruby* -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz
> --script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1
> --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB
> --l2_assoc=8 --cacheline_size=64 --max-checkpoints=1
>
>  The last line in system.pc.com_1.terminal is "Kernel command line:
> earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with
> the system.pc.com_1.terminal of a successful boot-up, the next line should
> be "Initializing CPU#0".
>
>  Reading through the forums, there seem to have been some previous issues
> with Ruby X86 FS, specific protocols, as well as checkpointing (i.e.
> checkpointing using atomic and restoring with detailed). Can you guys
> suggest what a working configuration would be (i.e. which protocol,
> checkpointing)? Essentially, I need to run the same setup as the one I
> tried with the classical model, but with minimum overhead in terms of
> simulation time due to using Ruby.
>
>  Thanks,
>
>  Ivan
>
>
> On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio <
> emilio.casti...@unican.es> wrote:
>
>  Hello,
>
>  Mr. Hestness is right, simulations do not make progress once the output
> has hung.
> The cpus keep executing some code and committing instructions, but they
> are doing spinlocks in almost all cases.
> So although cpus are actually executing stuff, the simulation is
> completely freeze. Thats what I meat by "instructions are still being
> committed", sorry for not being clear enough.
>
>  Just running in FS with the classic memory system and do a "cat
> /proc/cpuinfo" will crash the system.
>
>  Kind regards.
>
>
>   ------------------------------
> *De:* Joel Hestness [jthestn...@gmail.com]
> *Enviado:* domingo, 08 de junio de 2014 23:27
> *Para:* Castillo Villar, Emilio
> *CC:* Ivan Stalev; gem5 users mailing list
>
> *Asunto:* Re: [gem5-users] System Hangs
>
>    Hi guys,
>   I've been able to reproduce Ivan's issue in the latest gem5 (rev.
> 10235).  It seems this may be the same bug as a report that I filed about
> a year ago <http://flyspray.gem5.org/index.php?do=details&task_id=7>.
>  Previously, I had overlooked that Ivan's tests were using the classic
> memory system (and frankly, I had forgotten I had submitted that bug
> report).  I'll second Emilio and recommend using Ruby for now.
>
>    For anyone's future reference: Contrary to Emilio's statement, I'm not
> clear that simulations necessarily make progress after terminal output
> interrupts are lost with the classic memory model.  It is possible that
> unimplemented x86 atomics in the classic memory hierarchy could be the
> problem, and if so, many other problems besides hung terminal output could
> arise.
>
>    Joel
>
>
>
> On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <
> emilio.casti...@unican.es> wrote:
>
>  Hello,
>
> I have seen similar issues when running X86 timing and detailed cpus with
> the Classic memory system. Mostly due to X86 atomic memory accesses not
> being implemented. The stdout freezes but instructions are still being
> committed.
>
> If you want to run with timing or detailed cpus in X86 & FS & multi-core I
> am afraid you will need to use Ruby.
>
> Emilio
>  ------------------------------
> *De:* gem5-users [gem5-users-boun...@gem5.org] en nombre de Ivan Stalev
> via gem5-users [gem5-users@gem5.org]
> *Enviado:* viernes, 06 de junio de 2014 1:14
> *Para:* Joel Hestness
> *CC:* gem5 users mailing list
> *Asunto:* Re: [gem5-users] System Hangs
>
>    Hi Joel,
>
>  Thanks for getting back to me.
>
>  I ran it again with the ProtocolTrace flag and the only output there
> is:  0: rtc: Real-time clock set to Sun Jan  1 00:00:00 2012
>
>  With the Exec flag, I do see spinlock output on and off in the beginning
> during regular execution, so that is normal as you said. But once the
> "problem" occurs shortly after, the Exec flag output is just continuous
> spinlock forever as I posted previously.
>
>  The exact gem5 command lines I use are posted in my previous post. The
> kernel and disk image are the simply the default ones from the GEM5
> downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2
>
>  I have attached a zip file containing the following files:
>
>  BOOT-config.ini - The config.ini from the first run, i.e. booting in
> atomic mode, creating a checkpoint, and exiting.
> BOOT-system.pc.com_1.terminal - The terminal output from the first run
> CPT-config.ini - The config.ini when restoring from the checkpoint in
> detailed mode
> CPT-system.pc.com_1.terminal - The system output after restoring from the
> checkpoint
> run.c - The dummy program started by the run script
> run.rcS - The run script
> flag-exec-partial.out - The output from the Exec flag, right before the
> "problem" occurs, The infinite spinlock starts at tick 5268700121500
>
>  Again, this problem occurs even without checkpointing. I have also tried
> a few different kernels and disk images. I did the same test with both
> alpha and arm64 and it works, so it appears to just be an issue with x86.
>
>  Thank you,
>
>  Ivan
>
>
>
> On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <jthestn...@gmail.com>
> wrote:
>
> Hi Ivan,
>   Sorry for the delay on this.
>
>    I haven't had an opportunity to try to reproduce your problem, though
> the traces you've supplied here can help a bit.  Specifically, the stalled
> LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
> periodic interrupts to continue.  However, the last interrupt on CPU 1
> appears to get cleared, which looks fine.  The CPU spin lock is normal for
> threads that don't have any work to complete, but it's confusing why they
> wouldn't be doing something.
>
>    The next thing to dig into would be to figure out what the CPUs were
> doing last before they entered the spin loop.  For this we may need to
> trace a bit earlier in time using the Exec flags, and since it is likely
> that messages/responses may be getting lost in the memory hierarchy or
> devices, we'll need to use the ProtocolTrace flag to see what is being
> communicated.  You could try playing around with these as a start.
>
>    I may also have time to try to reproduce this over the next week, so
> I'm hoping you could give me some more information: can you send me your
> gem5 command line, config.ini, and system.pc.com_1.terminal output from
> your simulation, and details on the kernel and disk image that you're
> trying to use?
>
>
>    Thanks!
>    Joel
>
>
>
>
> On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <ids...@psu.edu> wrote:
>
> Hi,
>
>  Has anyone been able to reproduce this issue?
>
>  Thanks,
>
>  Ivan
>
>
> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <ids...@psu.edu> wrote:
>
> Hi Joel,
>
>  I am using revision 10124. I removed all of my own modifications just to
> be safe.
>
>  Running with gem5.opt and restoring from a boot-up checkpoint
> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
> infinite loop, executing this continuously:
>
>  5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
>
>  ....and so on repetitively without stopping.
>
>  Using --debug-flag=LocalApic, the output does indeed stop shortly after
> restoring from the checkpoint. The last output is:
> ..
>  5269570990500: system.cpu1.interrupts: Reported pending regular
> interrupt.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
> object.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
> offset 0xb0 as 0.
>
>  ...and no more output from this point on.
>
>  I appreciate your help tremendously.
>
>  Ivan
>
>
>
> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <jthestn...@gmail.com>
> wrote:
>
> Hi Ivan,
>   I believe that the email thread you previously referenced was related to
> a bug that we identified and fixed with changeset 9624
> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.  That
> bug was causing interrupts to be dropped in x86 when running with the O3
> CPU.  Are you using a version of gem5 after that changeset?  If not, I'd
> recommend updating to a more recent version and trying to replicate this
> issue again.
>
>    If you are using a more recent version of gem5, first, please let us
> know which changeset and whether you've made any changes.  Then, I'd
> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
> see if you can zero in on what is happening.  Specifically, first try
> passing the flag --debug-flag=Exec to look at what the CPU cores are
> executing (you may also want to pass --trace-start=<<tick>> with a
> simulator tick time close to when the hang happens).  This trace will
> include Linux kernel symbols for at least some of the lines if executing in
> the kernel (e.g. handling an interrupt).  If you've compiled your benchmark
> without debugging symbols, it may just show the memory addresses of
> instructions being executed within the application.  I will guess that
> you'll see kernel symbols for at least some of the executed instructions
> for interrupts.
>
>    If it appears that the CPUs are continuing to execute, try running
> with --debug-flag=LocalApic.  This will print the interrupts that each core
> is receiving, and if it stops printing at any point, it means something has
> gone wrong and we'd have to do some deeper digging.
>
>    Keep us posted on what you find,
>   Joel
>
>
>
> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <ids...@psu.edu> wrote:
>
> Hi Joel,
>
>  I have tried several different kernels and disk images, including the
> default ones provided on the GEM5 website in the x86-system.tar.bz2
> download. I run with these commands:
>
>  build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
> --script=rcs_scripts/run.rcS --max-checkpoints=1
>
>  My run.rcS script simply contains:
>
>  #!/bin/sh
>  /sbin/m5 resetstats
> /sbin/m5 checkpoint
>  echo 'booted'
> /extras/run
>  /sbin/m5 exit
>
>  where "/extras/run" is simply a C program with an infinite loop that
> prints a counter.
>
>  I then restore:
>
>  build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>
>  I specified the disk image file in Benchmarks.py. Restoring from the
> same checkpoint and running in atomic mode works fine. I also tried booting
> the system in detailed and letting it run for a while, but once it boots,
> there is no more output. So it seems that checkpointing is not the issue.
> The "run" program is just a dummy, and the same issue also persists when
> running SPEC benchmarks or any other program.
>
>  My dummy program is simply:
>
>      int count=0;
>     printf("**************************** HEYY \n");
>     while(1)
>         printf("\n %d \n", count++);
>
>  Letting it run for a while, the only output is exactly this:
>
>  booted
> *******
>
>  It doesn't even finish printing the first printf.
>
>  Another thing to add: In another scenario, I modified the kernel to call
> an m5 pseudo instruction on every context switch, and then GEM5 prints that
> a context switch occurred. Once again, in atomic mode this worked as
> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
> output stopped along with the system output in the terminal.
>
>  Thank you for your help.
>
>  Ivan
>
>
> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <jthestn...@gmail.com>
> wrote:
>
> Hi Ivan,
>   Can you please give more detail on what you're running?  Specifically,
> can you give your command line, and which kernel, disk image you're using?
>  Are you using checkpointing?
>
>    Joel
>
>
>  On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
> gem5-users@gem5.org> wrote:
>
>  Hello,
>
>  I am running X86 in full system mode. When running just 1 CPU, both
> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
> works fine, but in detailed mode the system appears to hang shortly after
> boot-up. GEM5 doesn't crash, but the system stops having any output.
> Looking at the stats, it appears that instructions are still being
> committed, but the actual applications/benchmarks are not making progress.
> The issue persists with the latest version of GEM5. I also tried two
> different kernel versions and several different disk images.
>
>  I might be experiencing what seems to be the same issue that was found
> about a year ago but appears to not have been fixed:
> https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html
>
>  Can anyone reproduce this or know of a solution?
>
>  Thank you,
>
>  Ivan
>
>
>
>  _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
>  --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
>  --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
>
>  --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
>  --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   http://pages.cs.wisc.edu/~hestness/
>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
>
> _______________________________________________ gem5-users mailing list
> gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] System Hangs

Reply via email to