Re: [gem5-dev] Atomic operations in X86 multicore FS has a bug

Jae-eon Jo Thu, 07 Feb 2013 23:29:27 -0800

Thanks, Steve.
Now I found gem5 classic memory has special treatment for LL, SC, Xchg,
which makes some ISA work nicely but others not.
Thanks for your suggestion, but switching from classic memory to Ruby is
not an option for me, because I don't know the overall gem5 system.
Using LL/SC inside X86 Atomic operations is the easiest solution, but it
will require loop and possibly result in livelock.
Maybe implementing generic atomic fetch-op will be the best, because I
could reference the implementation of those 3 RMW instructions.


Best regards.
Jae-eon Jo.

2013/2/8 Steve Reinhardt <[email protected]>

> Unfortunately it's a known limitation that x86 locked accesses don't work
> in the gem5 classic memory system.  This fact is documented (somewhat
> obscurely) here: http://gem5.org/Status_Matrix.
>
> If you'd be willing to take a stab at implementing this, that would be
> great.  Your other options are either to switch to an ISA that uses LL/SC
> rather than locked accesses (like ARM or Alpha) or to switch from the
> classic memory system to Ruby (which does support locked accesses).
>
> Steve
>
>
>
> On Thu, Feb 7, 2013 at 9:21 PM, Jae-eon Jo <[email protected]> wrote:
>
> > Hi all,
> >
> > I'm trying to simulate X86 multicore system.
> > Currently, the simulator can boot with timing 16-core, make checkpoint,
> and
> > reload with detailed 16-core.
> > To test it, I executed `parsec -a run -p blackscholes -i simsmall -n 4`,
> > but never had seen it ends for 10 hours. It seems there is no actual
> > progress.
> > So, I tried several configurations with different number of cores (3, 4,
> 8,
> > 10, 12, 14, ...). None of them completed boot process but stuck at
> > different phase of the process.
> >
> > Further investigations have revealed what is the problem. (Configuretion:
> > kernel=linux-2.6.22.9 num_cores=4)
> >
> > From instruction trace near it gets stuck, I found that only one core is
> > alive spinning on this code:
> >
> > // (__smp_call_function:arch/x86_64/kernel/smp.c)
> >
> > // 'data.started' is initialized as 0 before entering the loop, and 'cpus
> > == 3'.
> >
> > while (atomic_read(&data.started) != cpus)  // wait untill cpus == 3. not
> > different from ordinary load.
> >     cpu_relax();  // pause: nop with spinning hint
> >
> > The trace showed that 'data.started' had increased to 2 but not to 3.
> >
> > Also I inserted 'printk' at 'atomic_read' loop and 'atomic_inc'.
> > m5term:
> >
> > read:0
> > inc:2
> > inc:1
> > inc:2
> > read:2
> > read:2
> > read:2
> > read:2
> > read:2
> > and so on...
> >
> > instruction trace (format: CPUID:0xADDR:DISASSEMBLY):
> >
> > 2:0xffffffff80215de9:   INC_LOCKED_M.mfence
> > 3:0xffffffff80215de2:  MOV_R_P : rdip   t7, %ctrl153,
> > 0:0xffffffff80215ddf:  MFENCE
> > 3:0xffffffff80215de2:  MOV_R_P : ld   rax, DS:[t7 + 0x5fcb2f]
> > 2:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
> > 2:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
> > 1:0xffffffff802159e2:  CMP_M_R : ld   t1d, DS:[rsp + 0x10]:N
> > 1:0xffffffff802159e2:  CMP_M_R : sub   t0d, t1d, ebx
> > 3:0xffffffff80215de9:   INC_LOCKED_M.mfence
> > 0:0xffffffff80215de2:  MOV_R_P : rdip   t7, %ctrl153,
> > 0:0xffffffff80215de2:  MOV_R_P : ld   rax, DS:[t7 + 0x5fcb2f]
> > 1:0xffffffff802159e6:  JNZ_I : rdip   t1, %ctrl153,
> > 1:0xffffffff802159e6:  JNZ_I : limm   t2, 0xfffffffffffffff8
> > 1:0xffffffff802159e6:  JNZ_I : wrip   , t1, t2
> > 1:0xffffffff802159e0:  NOP
> > 0:0xffffffff80215de9:   INC_LOCKED_M.mfence
> > 2:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
> > 2:0xffffffff80215de9:   INC_LOCKED_M.mfence
> > 3:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
> > 3:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
> > 2:0xffffffff80215ded:  CALL_NEAR_I : limm   t1, 0xffffffffffff243e
> > 2:0xffffffff80215ded:  CALL_NEAR_I : rdip   t7, %ctrl153,
> > 2:0xffffffff80215ded:  CALL_NEAR_I : st   t7, SS:[rsp +
> 0xfffffffffffffff8]
> > 2:0xffffffff80215ded:  CALL_NEAR_I : subi   rsp, rsp, 0x8
> > 2:0xffffffff80215ded:  CALL_NEAR_I : wrip   , t7, t1
> > 0:0xffffffff80215de9:  INC_LOCKED_M : ldstl   t1d, DS:[rax + 0x10]:N
> > 0:0xffffffff80215de9:  INC_LOCKED_M : addi   t1d, t1d, 0x1
> > 1:0xffffffff802159e2:  CMP_M_R : ld   t1d, DS:[rsp + 0x10]:N
> > 1:0xffffffff802159e2:  CMP_M_R : sub   t0d, t1d, ebx
> > 1:0xffffffff802159e6:  JNZ_I : rdip   t1, %ctrl153,
> > 1:0xffffffff802159e6:  JNZ_I : limm   t2, 0xfffffffffffffff8
> > 1:0xffffffff802159e6:  JNZ_I : wrip   , t1, t2
> > 2:0xffffffff80208230:  MOV_R_M : ld   rax, GS:[0]
> > 3:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
> > 3:0xffffffff80215de9:   INC_LOCKED_M.mfence
> > 1:0xffffffff802159e0:  NOP
> > 0:0xffffffff80215de9:  INC_LOCKED_M : stul   t1d, DS:[rax + 0x10]:N
> > 0:0xffffffff80215de9:   INC_LOCKED_M.mfence
> >
> > disassembly of vmlinux:
> >
> > ffffffff80215de9:   f0 ff 40 10             lock incl 0x10(%rax)
> >
> >
> > As you see, core2 did ldstl(load0)/addi(set1)/stul(store1) with no
> > interference.
> > However, before core3 did stul(store2), core0 did ldstl(load1), resulting
> > in stul(store2). M.mfence did not provide atomicity of the instruction,
> at
> > all. (Actually, mFence::execute(...) in timing_simple_cpu_exec.cc does
> > nothing)
> >
> > Is there any problem with my explanation? If not, I'll try to fix it,
> even
> > though it seems not easy for me. Any advice is welcome.
> > Thanks,
> > Jae-eon Jo.
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Atomic operations in X86 multicore FS has a bug

Reply via email to