Re: [gem5-users] System Hangs

Ivan Stalev via gem5-users Fri, 16 May 2014 22:51:27 -0700

Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be
safe.


Running with gem5.opt and restoring from a boot-up checkpoint
with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :   JLE_I
: rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :   JLE_I
: limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2  :   JLE_I
: wrip   , t1, t2  : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16    :   NOP
                   : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :   JLE_I
: rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :   JLE_I
: limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :   JLE_I
: wrip   , t1, t2  : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
                   : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0  :
CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1  :
CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2  :
CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0  :   JLE_I
: rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1  :   JLE_I
: limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :   JLE_I
: wrip   , t1, t2  : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
                   : IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after
restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
offset 0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <jthestn...@gmail.com>wrote:

> Hi Ivan,
>   I believe that the email thread you previously referenced was related to
> a bug that we identified and fixed with changeset 
> 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>  That bug was causing interrupts to be dropped in x86 when running with the
> O3 CPU.  Are you using a version of gem5 after that changeset?  If not, I'd
> recommend updating to a more recent version and trying to replicate this
> issue again.
>
>   If you are using a more recent version of gem5, first, please let us
> know which changeset and whether you've made any changes.  Then, I'd
> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
> see if you can zero in on what is happening.  Specifically, first try
> passing the flag --debug-flag=Exec to look at what the CPU cores are
> executing (you may also want to pass --trace-start=<<tick>> with a
> simulator tick time close to when the hang happens).  This trace will
> include Linux kernel symbols for at least some of the lines if executing in
> the kernel (e.g. handling an interrupt).  If you've compiled your benchmark
> without debugging symbols, it may just show the memory addresses of
> instructions being executed within the application.  I will guess that
> you'll see kernel symbols for at least some of the executed instructions
> for interrupts.
>
>   If it appears that the CPUs are continuing to execute, try running with
> --debug-flag=LocalApic.  This will print the interrupts that each core is
> receiving, and if it stops printing at any point, it means something has
> gone wrong and we'd have to do some deeper digging.
>
>   Keep us posted on what you find,
>   Joel
>
>
>
> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <ids...@psu.edu> wrote:
>
>> Hi Joel,
>>
>> I have tried several different kernels and disk images, including the
>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>> download. I run with these commands:
>>
>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>
>> My run.rcS script simply contains:
>>
>> #!/bin/sh
>> /sbin/m5 resetstats
>> /sbin/m5 checkpoint
>> echo 'booted'
>> /extras/run
>> /sbin/m5 exit
>>
>> where "/extras/run" is simply a C program with an infinite loop that
>> prints a counter.
>>
>> I then restore:
>>
>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>
>> I specified the disk image file in Benchmarks.py. Restoring from the same
>> checkpoint and running in atomic mode works fine. I also tried booting the
>> system in detailed and letting it run for a while, but once it boots, there
>> is no more output. So it seems that checkpointing is not the issue. The
>> "run" program is just a dummy, and the same issue also persists when
>> running SPEC benchmarks or any other program.
>>
>> My dummy program is simply:
>>
>>     int count=0;
>>     printf("**************************** HEYY \n");
>>     while(1)
>>         printf("\n %d \n", count++);
>>
>> Letting it run for a while, the only output is exactly this:
>>
>> booted
>> *******
>>
>> It doesn't even finish printing the first printf.
>>
>> Another thing to add: In another scenario, I modified the kernel to call
>> an m5 pseudo instruction on every context switch, and then GEM5 prints that
>> a context switch occurred. Once again, in atomic mode this worked as
>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>> output stopped along with the system output in the terminal.
>>
>> Thank you for your help.
>>
>> Ivan
>>
>>
>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <jthestn...@gmail.com>wrote:
>>
>>> Hi Ivan,
>>>   Can you please give more detail on what you're running?  Specifically,
>>> can you give your command line, and which kernel, disk image you're using?
>>>  Are you using checkpointing?
>>>
>>>   Joel
>>>
>>>
>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>> gem5-users@gem5.org> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>> Looking at the stats, it appears that instructions are still being
>>>> committed, but the actual applications/benchmarks are not making progress.
>>>> The issue persists with the latest version of GEM5. I also tried two
>>>> different kernel versions and several different disk images.
>>>>
>>>> I might be experiencing what seems to be the same issue that was found
>>>> about a year ago but appears to not have been fixed:
>>>> https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html
>>>>
>>>> Can anyone reproduce this or know of a solution?
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>> --
>>>   Joel Hestness
>>>   PhD Student, Computer Architecture
>>>   Dept. of Computer Science, University of Wisconsin - Madison
>>>   http://pages.cs.wisc.edu/~hestness/
>>>
>>
>>
>
>
> --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   http://pages.cs.wisc.edu/~hestness/
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] System Hangs

Reply via email to