Hi Jiuyue, Thanks for sharing that information. I applied your patch and the problem appears to be fixed! I will therefore continue to use the classic memory model with X86 FS detailed CPUs and will post if I run into any further issues. So it appears that this was a kernel bug and not a GEM5 bug.
Ivan On Mon, Jun 9, 2014 at 4:37 AM, 马久跃 via gem5-users <gem5-users@gem5.org> wrote: > Hi all, > > I found someone in linux-rt community had report the same uart deadlock > problem (http://www.spinics.net/lists/linux-rt-users/msg09246.html), they > point out a recursive dead lock in uart driver, the call stack is show as > following: > > ------------------------------------------------------------ > mpc52xx_uart_int() > > *lock(port->lock);* > > mpc52xx_psc_handle_irq() > > mpc52xx_uart_int_tx_chars() > > uart_write_wakeup() > > tty_wakeup() > > hci_uart_tx_wakeup() > > len = tty->ops->write(tty, skb->data, skb->len); > > The associated write function is uart_write > > uart_write() > > lock(port->lock) *--> deadlock* > ------------------------------------------------------------ > > it seems uart driver use a single lock (port->lock) for both "struct > uart_port" > and "struct uart_info", which caused the recursive deadlock. > > I have make a patch (based on linux-2.6.22 kernel) for this problem, it > did following things to avoid deadlock: > - add extra lock for struct uart_info > - protect struct uart_info using port->info->lock instead of port->lock > > it works fine for me, but I'm not sure if it can really solve this system > hangs problem. You guys can try this patch. > > > Jiuyue Ma > > > ------------------------------ > To: ids...@psu.edu; gem5-users@gem5.org; emilio.casti...@unican.es > Date: Mon, 9 Jun 2014 08:49:22 +0100 > > Subject: Re: [gem5-users] System Hangs > From: gem5-users@gem5.org > > > Hi all, > > It would be valuable if someone actually dig in and fixed the issue to > that the non-Ruby memory system works for X86. I would imagine that the > speed/fidelity trade-off of the classic memory system appeals to a wider > audience. If not, there is always ARM :-). > > Andreas > > From: Ivan Stalev via gem5-users <gem5-users@gem5.org> > Reply-To: Ivan Stalev <ids...@psu.edu>, gem5 users mailing list < > gem5-users@gem5.org> > Date: Monday, 9 June 2014 06:29 > To: "Castillo Villar, Emilio" <emilio.casti...@unican.es> > Cc: gem5 users mailing list <gem5-users@gem5.org> > Subject: Re: [gem5-users] System Hangs > > Hi, > > Thank you for verifying and reproducing the bug. I am now attempting to > run with Ruby as you suggested; however, I am getting a seg fault during > boot-up. I compile like this: > > scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer > > and then run/boot-up like this: > > build/X86/gem5.fast -d m5out/test_run configs/example/fs.py > --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp > *--ruby* -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz > --script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1 > --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB > --l2_assoc=8 --cacheline_size=64 --max-checkpoints=1 > > The last line in system.pc.com_1.terminal is "Kernel command line: > earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with > the system.pc.com_1.terminal of a successful boot-up, the next line should > be "Initializing CPU#0". > > Reading through the forums, there seem to have been some previous issues > with Ruby X86 FS, specific protocols, as well as checkpointing (i.e. > checkpointing using atomic and restoring with detailed). Can you guys > suggest what a working configuration would be (i.e. which protocol, > checkpointing)? Essentially, I need to run the same setup as the one I > tried with the classical model, but with minimum overhead in terms of > simulation time due to using Ruby. > > Thanks, > > Ivan > > > On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio < > emilio.casti...@unican.es> wrote: > > Hello, > > Mr. Hestness is right, simulations do not make progress once the output > has hung. > The cpus keep executing some code and committing instructions, but they > are doing spinlocks in almost all cases. > So although cpus are actually executing stuff, the simulation is > completely freeze. Thats what I meat by "instructions are still being > committed", sorry for not being clear enough. > > Just running in FS with the classic memory system and do a "cat > /proc/cpuinfo" will crash the system. > > Kind regards. > > > ------------------------------ > *De:* Joel Hestness [jthestn...@gmail.com] > *Enviado:* domingo, 08 de junio de 2014 23:27 > *Para:* Castillo Villar, Emilio > *CC:* Ivan Stalev; gem5 users mailing list > > *Asunto:* Re: [gem5-users] System Hangs > > Hi guys, > I've been able to reproduce Ivan's issue in the latest gem5 (rev. > 10235). It seems this may be the same bug as a report that I filed about > a year ago <http://flyspray.gem5.org/index.php?do=details&task_id=7>. > Previously, I had overlooked that Ivan's tests were using the classic > memory system (and frankly, I had forgotten I had submitted that bug > report). I'll second Emilio and recommend using Ruby for now. > > For anyone's future reference: Contrary to Emilio's statement, I'm not > clear that simulations necessarily make progress after terminal output > interrupts are lost with the classic memory model. It is possible that > unimplemented x86 atomics in the classic memory hierarchy could be the > problem, and if so, many other problems besides hung terminal output could > arise. > > Joel > > > > On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio < > emilio.casti...@unican.es> wrote: > > Hello, > > I have seen similar issues when running X86 timing and detailed cpus with > the Classic memory system. Mostly due to X86 atomic memory accesses not > being implemented. The stdout freezes but instructions are still being > committed. > > If you want to run with timing or detailed cpus in X86 & FS & multi-core I > am afraid you will need to use Ruby. > > Emilio > ------------------------------ > *De:* gem5-users [gem5-users-boun...@gem5.org] en nombre de Ivan Stalev > via gem5-users [gem5-users@gem5.org] > *Enviado:* viernes, 06 de junio de 2014 1:14 > *Para:* Joel Hestness > *CC:* gem5 users mailing list > *Asunto:* Re: [gem5-users] System Hangs > > Hi Joel, > > Thanks for getting back to me. > > I ran it again with the ProtocolTrace flag and the only output there > is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012 > > With the Exec flag, I do see spinlock output on and off in the beginning > during regular execution, so that is normal as you said. But once the > "problem" occurs shortly after, the Exec flag output is just continuous > spinlock forever as I posted previously. > > The exact gem5 command lines I use are posted in my previous post. The > kernel and disk image are the simply the default ones from the GEM5 > downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2 > > I have attached a zip file containing the following files: > > BOOT-config.ini - The config.ini from the first run, i.e. booting in > atomic mode, creating a checkpoint, and exiting. > BOOT-system.pc.com_1.terminal - The terminal output from the first run > CPT-config.ini - The config.ini when restoring from the checkpoint in > detailed mode > CPT-system.pc.com_1.terminal - The system output after restoring from the > checkpoint > run.c - The dummy program started by the run script > run.rcS - The run script > flag-exec-partial.out - The output from the Exec flag, right before the > "problem" occurs, The infinite spinlock starts at tick 5268700121500 > > Again, this problem occurs even without checkpointing. I have also tried > a few different kernels and disk images. I did the same test with both > alpha and arm64 and it works, so it appears to just be an issue with x86. > > Thank you, > > Ivan > > > > On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <jthestn...@gmail.com> > wrote: > > Hi Ivan, > Sorry for the delay on this. > > I haven't had an opportunity to try to reproduce your problem, though > the traces you've supplied here can help a bit. Specifically, the stalled > LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect > periodic interrupts to continue. However, the last interrupt on CPU 1 > appears to get cleared, which looks fine. The CPU spin lock is normal for > threads that don't have any work to complete, but it's confusing why they > wouldn't be doing something. > > The next thing to dig into would be to figure out what the CPUs were > doing last before they entered the spin loop. For this we may need to > trace a bit earlier in time using the Exec flags, and since it is likely > that messages/responses may be getting lost in the memory hierarchy or > devices, we'll need to use the ProtocolTrace flag to see what is being > communicated. You could try playing around with these as a start. > > I may also have time to try to reproduce this over the next week, so > I'm hoping you could give me some more information: can you send me your > gem5 command line, config.ini, and system.pc.com_1.terminal output from > your simulation, and details on the kernel and disk image that you're > trying to use? > > > Thanks! > Joel > > > > > On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <ids...@psu.edu> wrote: > > Hi, > > Has anyone been able to reproduce this issue? > > Thanks, > > Ivan > > > On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <ids...@psu.edu> wrote: > > Hi Joel, > > I am using revision 10124. I removed all of my own modifications just to > be safe. > > Running with gem5.opt and restoring from a boot-up checkpoint > with--debug-flag=Exec, it appears that the CPU is stuck in some sort of > infinite loop, executing this continuously: > > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : > CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000 > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : > CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe > A=0xffffffff80822400 > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : > CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000 > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : > JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897 > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : > JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9 > 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : > JLE_I : wrip , t1, t2 : IntAlu : > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP > : IntAlu : > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : > CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000 > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : > CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe > A=0xffffffff80822400 > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : > CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000 > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : > JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897 > 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : > JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9 > 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : > JLE_I : wrip , t1, t2 : IntAlu : > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP > : IntAlu : > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 : > CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000 > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 : > CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe > A=0xffffffff80822400 > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 : > CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000 > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : > JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897 > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : > JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9 > 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : > JLE_I : wrip , t1, t2 : IntAlu : > 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP > : IntAlu : > > ....and so on repetitively without stopping. > > Using --debug-flag=LocalApic, the output does indeed stop shortly after > restoring from the checkpoint. The last output is: > .. > 5269570990500: system.cpu1.interrupts: Reported pending regular > interrupt. > 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt. > 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault > object. > 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt. > 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core. > 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at > offset 0xb0 as 0. > > ...and no more output from this point on. > > I appreciate your help tremendously. > > Ivan > > > > On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <jthestn...@gmail.com> > wrote: > > Hi Ivan, > I believe that the email thread you previously referenced was related to > a bug that we identified and fixed with changeset 9624 > <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That > bug was causing interrupts to be dropped in x86 when running with the O3 > CPU. Are you using a version of gem5 after that changeset? If not, I'd > recommend updating to a more recent version and trying to replicate this > issue again. > > If you are using a more recent version of gem5, first, please let us > know which changeset and whether you've made any changes. Then, I'd > recommend compiling gem5.opt and using the DPRINTF tracing functionality to > see if you can zero in on what is happening. Specifically, first try > passing the flag --debug-flag=Exec to look at what the CPU cores are > executing (you may also want to pass --trace-start=<<tick>> with a > simulator tick time close to when the hang happens). This trace will > include Linux kernel symbols for at least some of the lines if executing in > the kernel (e.g. handling an interrupt). If you've compiled your benchmark > without debugging symbols, it may just show the memory addresses of > instructions being executed within the application. I will guess that > you'll see kernel symbols for at least some of the executed instructions > for interrupts. > > If it appears that the CPUs are continuing to execute, try running > with --debug-flag=LocalApic. This will print the interrupts that each core > is receiving, and if it stops printing at any point, it means something has > gone wrong and we'd have to do some deeper digging. > > Keep us posted on what you find, > Joel > > > > On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <ids...@psu.edu> wrote: > > Hi Joel, > > I have tried several different kernels and disk images, including the > default ones provided on the GEM5 website in the x86-system.tar.bz2 > download. I run with these commands: > > build/X86/gem5.fast -d m5out/test_run configs/example/fs.py > --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp > -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz > --script=rcs_scripts/run.rcS --max-checkpoints=1 > > My run.rcS script simply contains: > > #!/bin/sh > /sbin/m5 resetstats > /sbin/m5 checkpoint > echo 'booted' > /extras/run > /sbin/m5 exit > > where "/extras/run" is simply a C program with an infinite loop that > prints a counter. > > I then restore: > > build/X86/gem5.fast -d m5out/test_run configs/example/fs.py > --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp > -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches > --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 > --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64 > > I specified the disk image file in Benchmarks.py. Restoring from the > same checkpoint and running in atomic mode works fine. I also tried booting > the system in detailed and letting it run for a while, but once it boots, > there is no more output. So it seems that checkpointing is not the issue. > The "run" program is just a dummy, and the same issue also persists when > running SPEC benchmarks or any other program. > > My dummy program is simply: > > int count=0; > printf("**************************** HEYY \n"); > while(1) > printf("\n %d \n", count++); > > Letting it run for a while, the only output is exactly this: > > booted > ******* > > It doesn't even finish printing the first printf. > > Another thing to add: In another scenario, I modified the kernel to call > an m5 pseudo instruction on every context switch, and then GEM5 prints that > a context switch occurred. Once again, in atomic mode this worked as > expected. However, in detailed, even the GEM5 (printf inside GEM5 itself) > output stopped along with the system output in the terminal. > > Thank you for your help. > > Ivan > > > On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <jthestn...@gmail.com> > wrote: > > Hi Ivan, > Can you please give more detail on what you're running? Specifically, > can you give your command line, and which kernel, disk image you're using? > Are you using checkpointing? > > Joel > > > On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users < > gem5-users@gem5.org> wrote: > > Hello, > > I am running X86 in full system mode. When running just 1 CPU, both > atomic and detailed mode work fine. However, with more than 1 CPU, atomic > works fine, but in detailed mode the system appears to hang shortly after > boot-up. GEM5 doesn't crash, but the system stops having any output. > Looking at the stats, it appears that instructions are still being > committed, but the actual applications/benchmarks are not making progress. > The issue persists with the latest version of GEM5. I also tried two > different kernel versions and several different disk images. > > I might be experiencing what seems to be the same issue that was found > about a year ago but appears to not have been fixed: > https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html > > Can anyone reproduce this or know of a solution? > > Thank you, > > Ivan > > > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > > > -- > Joel Hestness > PhD Student, Computer Architecture > Dept. of Computer Science, University of Wisconsin - Madison > http://pages.cs.wisc.edu/~hestness/ > > > > > > -- > Joel Hestness > PhD Student, Computer Architecture > Dept. of Computer Science, University of Wisconsin - Madison > http://pages.cs.wisc.edu/~hestness/ > > > > > > > -- > Joel Hestness > PhD Student, Computer Architecture > Dept. of Computer Science, University of Wisconsin - Madison > http://pages.cs.wisc.edu/~hestness/ > > > > > > -- > Joel Hestness > PhD Student, Computer Architecture > Dept. of Computer Science, University of Wisconsin - Madison > http://pages.cs.wisc.edu/~hestness/ > > > > -- IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2557590 > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2548782 > > _______________________________________________ gem5-users mailing list > gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users