Hi all,
I'm trying to simulate X86 multicore system.
Currently, the simulator can boot with timing 16-core, make checkpoint, and
reload with detailed 16-core.
To test it, I executed `parsec -a run -p blackscholes -i simsmall -n 4`,
but never had seen it ends for 10 hours. It seems there is no actual
progress.
So, I tried several configurations with different number of cores (3, 4, 8,
10, 12, 14, ...). None of them completed boot process but stuck at
different phase of the process.
Further investigations have revealed what is the problem. (Configuretion:
kernel=linux-2.6.22.9 num_cores=4)
>From instruction trace near it gets stuck, I found that only one core is
alive spinning on this code:
// (__smp_call_function:arch/x86_64/kernel/smp.c)
// 'data.started' is initialized as 0 before entering the loop, and 'cpus
== 3'.
while (atomic_read(&data.started) != cpus) // wait untill cpus == 3. not
different from ordinary load.
cpu_relax(); // pause: nop with spinning hint
The trace showed that 'data.started' had increased to 2 but not to 3.
Also I inserted 'printk' at 'atomic_read' loop and 'atomic_inc'.
m5term:
read:0
inc:2
inc:1
inc:2
read:2
read:2
read:2
read:2
read:2
and so on...
instruction trace (format: CPUID:0xADDR:DISASSEMBLY):
2:0xffffffff80215de9: INC_LOCKED_M.mfence
3:0xffffffff80215de2: MOV_R_P : rdip t7, %ctrl153,
0:0xffffffff80215ddf: MFENCE
3:0xffffffff80215de2: MOV_R_P : ld rax, DS:[t7 + 0x5fcb2f]
2:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N
2:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1
1:0xffffffff802159e2: CMP_M_R : ld t1d, DS:[rsp + 0x10]:N
1:0xffffffff802159e2: CMP_M_R : sub t0d, t1d, ebx
3:0xffffffff80215de9: INC_LOCKED_M.mfence
0:0xffffffff80215de2: MOV_R_P : rdip t7, %ctrl153,
0:0xffffffff80215de2: MOV_R_P : ld rax, DS:[t7 + 0x5fcb2f]
1:0xffffffff802159e6: JNZ_I : rdip t1, %ctrl153,
1:0xffffffff802159e6: JNZ_I : limm t2, 0xfffffffffffffff8
1:0xffffffff802159e6: JNZ_I : wrip , t1, t2
1:0xffffffff802159e0: NOP
0:0xffffffff80215de9: INC_LOCKED_M.mfence
2:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N
2:0xffffffff80215de9: INC_LOCKED_M.mfence
3:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N
3:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1
2:0xffffffff80215ded: CALL_NEAR_I : limm t1, 0xffffffffffff243e
2:0xffffffff80215ded: CALL_NEAR_I : rdip t7, %ctrl153,
2:0xffffffff80215ded: CALL_NEAR_I : st t7, SS:[rsp + 0xfffffffffffffff8]
2:0xffffffff80215ded: CALL_NEAR_I : subi rsp, rsp, 0x8
2:0xffffffff80215ded: CALL_NEAR_I : wrip , t7, t1
0:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N
0:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1
1:0xffffffff802159e2: CMP_M_R : ld t1d, DS:[rsp + 0x10]:N
1:0xffffffff802159e2: CMP_M_R : sub t0d, t1d, ebx
1:0xffffffff802159e6: JNZ_I : rdip t1, %ctrl153,
1:0xffffffff802159e6: JNZ_I : limm t2, 0xfffffffffffffff8
1:0xffffffff802159e6: JNZ_I : wrip , t1, t2
2:0xffffffff80208230: MOV_R_M : ld rax, GS:[0]
3:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N
3:0xffffffff80215de9: INC_LOCKED_M.mfence
1:0xffffffff802159e0: NOP
0:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N
0:0xffffffff80215de9: INC_LOCKED_M.mfence
disassembly of vmlinux:
ffffffff80215de9: f0 ff 40 10 lock incl 0x10(%rax)
As you see, core2 did ldstl(load0)/addi(set1)/stul(store1) with no
interference.
However, before core3 did stul(store2), core0 did ldstl(load1), resulting
in stul(store2). M.mfence did not provide atomicity of the instruction, at
all. (Actually, mFence::execute(...) in timing_simple_cpu_exec.cc does
nothing)
Is there any problem with my explanation? If not, I'll try to fix it, even
though it seems not easy for me. Any advice is welcome.
Thanks,
Jae-eon Jo.
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev