Hi all,

I'm trying to simulate X86 multicore system.
Currently, the simulator can boot with timing 16-core, make checkpoint, and
reload with detailed 16-core.
To test it, I executed `parsec -a run -p blackscholes -i simsmall -n 4`,
but never had seen it ends for 10 hours. It seems there is no actual
progress.
So, I tried several configurations with different number of cores (3, 4, 8,
10, 12, 14, ...). None of them completed boot process but stuck at
different phase of the process.

Further investigations have revealed what is the problem. (Configuretion:
kernel=linux-2.6.22.9 num_cores=4)

>From instruction trace near it gets stuck, I found that only one core is
alive spinning on this code:

// (__smp_call_function:arch/x86_64/kernel/smp.c)

// 'data.started' is initialized as 0 before entering the loop, and 'cpus
== 3'.

while (atomic_read(&data.started) != cpus)  // wait untill cpus == 3. not
different from ordinary load.
    cpu_relax();  // pause: nop with spinning hint

The trace showed that 'data.started' had increased to 2 but not to 3.

Also I inserted 'printk' at 'atomic_read' loop and 'atomic_inc'.
m5term:

read:0
inc:2
inc:1
inc:2
read:2
read:2
read:2
read:2
read:2
and so on...

instruction trace (format: CPUID:0xADDR:DISASSEMBLY):

2:0xffffffff80215de9:   INC_LOCKED_M.mfence
3:0xffffffff80215de2:  MOV_R_P : rdip   t7, %ctrl153,
0:0xffffffff80215ddf:  MFENCE
3:0xffffffff80215de2:  MOV_R_P : ld   rax, DS:[t7 + 0x5fcb2f]
2:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
2:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
1:0xffffffff802159e2:  CMP_M_R : ld   t1d, DS:[rsp + 0x10]:N
1:0xffffffff802159e2:  CMP_M_R : sub   t0d, t1d, ebx
3:0xffffffff80215de9:   INC_LOCKED_M.mfence
0:0xffffffff80215de2:  MOV_R_P : rdip   t7, %ctrl153,
0:0xffffffff80215de2:  MOV_R_P : ld   rax, DS:[t7 + 0x5fcb2f]
1:0xffffffff802159e6:  JNZ_I : rdip   t1, %ctrl153,
1:0xffffffff802159e6:  JNZ_I : limm   t2, 0xfffffffffffffff8
1:0xffffffff802159e6:  JNZ_I : wrip   , t1, t2
1:0xffffffff802159e0:  NOP
0:0xffffffff80215de9:   INC_LOCKED_M.mfence
2:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
2:0xffffffff80215de9:   INC_LOCKED_M.mfence
3:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
3:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
2:0xffffffff80215ded:  CALL_NEAR_I : limm   t1, 0xffffffffffff243e
2:0xffffffff80215ded:  CALL_NEAR_I : rdip   t7, %ctrl153,
2:0xffffffff80215ded:  CALL_NEAR_I : st   t7, SS:[rsp + 0xfffffffffffffff8]
2:0xffffffff80215ded:  CALL_NEAR_I : subi   rsp, rsp, 0x8
2:0xffffffff80215ded:  CALL_NEAR_I : wrip   , t7, t1
0:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
0:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
1:0xffffffff802159e2:  CMP_M_R : ld   t1d, DS:[rsp + 0x10]:N
1:0xffffffff802159e2:  CMP_M_R : sub   t0d, t1d, ebx
1:0xffffffff802159e6:  JNZ_I : rdip   t1, %ctrl153,
1:0xffffffff802159e6:  JNZ_I : limm   t2, 0xfffffffffffffff8
1:0xffffffff802159e6:  JNZ_I : wrip   , t1, t2
2:0xffffffff80208230:  MOV_R_M : ld   rax, GS:[0]
3:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
3:0xffffffff80215de9:   INC_LOCKED_M.mfence
1:0xffffffff802159e0:  NOP
0:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
0:0xffffffff80215de9:   INC_LOCKED_M.mfence

disassembly of vmlinux:

ffffffff80215de9:   f0 ff 40 10             lock incl 0x10(%rax)


As you see, core2 did ldstl(load0)/addi(set1)/stul(store1) with no
interference.
However, before core3 did stul(store2), core0 did ldstl(load1), resulting
in stul(store2). M.mfence did not provide atomicity of the instruction, at
all. (Actually, mFence::execute(...) in timing_simple_cpu_exec.cc does
nothing)

Is there any problem with my explanation? If not, I'll try to fix it, even
though it seems not easy for me. Any advice is welcome.
Thanks,
Jae-eon Jo.
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to