Thanks, Steve. Now I found gem5 classic memory has special treatment for LL, SC, Xchg, which makes some ISA work nicely but others not. Thanks for your suggestion, but switching from classic memory to Ruby is not an option for me, because I don't know the overall gem5 system. Using LL/SC inside X86 Atomic operations is the easiest solution, but it will require loop and possibly result in livelock. Maybe implementing generic atomic fetch-op will be the best, because I could reference the implementation of those 3 RMW instructions.
Best regards. Jae-eon Jo. 2013/2/8 Steve Reinhardt <[email protected]> > Unfortunately it's a known limitation that x86 locked accesses don't work > in the gem5 classic memory system. This fact is documented (somewhat > obscurely) here: http://gem5.org/Status_Matrix. > > If you'd be willing to take a stab at implementing this, that would be > great. Your other options are either to switch to an ISA that uses LL/SC > rather than locked accesses (like ARM or Alpha) or to switch from the > classic memory system to Ruby (which does support locked accesses). > > Steve > > > > On Thu, Feb 7, 2013 at 9:21 PM, Jae-eon Jo <[email protected]> wrote: > > > Hi all, > > > > I'm trying to simulate X86 multicore system. > > Currently, the simulator can boot with timing 16-core, make checkpoint, > and > > reload with detailed 16-core. > > To test it, I executed `parsec -a run -p blackscholes -i simsmall -n 4`, > > but never had seen it ends for 10 hours. It seems there is no actual > > progress. > > So, I tried several configurations with different number of cores (3, 4, > 8, > > 10, 12, 14, ...). None of them completed boot process but stuck at > > different phase of the process. > > > > Further investigations have revealed what is the problem. (Configuretion: > > kernel=linux-2.6.22.9 num_cores=4) > > > > From instruction trace near it gets stuck, I found that only one core is > > alive spinning on this code: > > > > // (__smp_call_function:arch/x86_64/kernel/smp.c) > > > > // 'data.started' is initialized as 0 before entering the loop, and 'cpus > > == 3'. > > > > while (atomic_read(&data.started) != cpus) // wait untill cpus == 3. not > > different from ordinary load. > > cpu_relax(); // pause: nop with spinning hint > > > > The trace showed that 'data.started' had increased to 2 but not to 3. > > > > Also I inserted 'printk' at 'atomic_read' loop and 'atomic_inc'. > > m5term: > > > > read:0 > > inc:2 > > inc:1 > > inc:2 > > read:2 > > read:2 > > read:2 > > read:2 > > read:2 > > and so on... > > > > instruction trace (format: CPUID:0xADDR:DISASSEMBLY): > > > > 2:0xffffffff80215de9: INC_LOCKED_M.mfence > > 3:0xffffffff80215de2: MOV_R_P : rdip t7, %ctrl153, > > 0:0xffffffff80215ddf: MFENCE > > 3:0xffffffff80215de2: MOV_R_P : ld rax, DS:[t7 + 0x5fcb2f] > > 2:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N > > 2:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1 > > 1:0xffffffff802159e2: CMP_M_R : ld t1d, DS:[rsp + 0x10]:N > > 1:0xffffffff802159e2: CMP_M_R : sub t0d, t1d, ebx > > 3:0xffffffff80215de9: INC_LOCKED_M.mfence > > 0:0xffffffff80215de2: MOV_R_P : rdip t7, %ctrl153, > > 0:0xffffffff80215de2: MOV_R_P : ld rax, DS:[t7 + 0x5fcb2f] > > 1:0xffffffff802159e6: JNZ_I : rdip t1, %ctrl153, > > 1:0xffffffff802159e6: JNZ_I : limm t2, 0xfffffffffffffff8 > > 1:0xffffffff802159e6: JNZ_I : wrip , t1, t2 > > 1:0xffffffff802159e0: NOP > > 0:0xffffffff80215de9: INC_LOCKED_M.mfence > > 2:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N > > 2:0xffffffff80215de9: INC_LOCKED_M.mfence > > 3:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N > > 3:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1 > > 2:0xffffffff80215ded: CALL_NEAR_I : limm t1, 0xffffffffffff243e > > 2:0xffffffff80215ded: CALL_NEAR_I : rdip t7, %ctrl153, > > 2:0xffffffff80215ded: CALL_NEAR_I : st t7, SS:[rsp + > 0xfffffffffffffff8] > > 2:0xffffffff80215ded: CALL_NEAR_I : subi rsp, rsp, 0x8 > > 2:0xffffffff80215ded: CALL_NEAR_I : wrip , t7, t1 > > 0:0xffffffff80215de9: INC_LOCKED_M : ldstl t1d, DS:[rax + 0x10]:N > > 0:0xffffffff80215de9: INC_LOCKED_M : addi t1d, t1d, 0x1 > > 1:0xffffffff802159e2: CMP_M_R : ld t1d, DS:[rsp + 0x10]:N > > 1:0xffffffff802159e2: CMP_M_R : sub t0d, t1d, ebx > > 1:0xffffffff802159e6: JNZ_I : rdip t1, %ctrl153, > > 1:0xffffffff802159e6: JNZ_I : limm t2, 0xfffffffffffffff8 > > 1:0xffffffff802159e6: JNZ_I : wrip , t1, t2 > > 2:0xffffffff80208230: MOV_R_M : ld rax, GS:[0] > > 3:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N > > 3:0xffffffff80215de9: INC_LOCKED_M.mfence > > 1:0xffffffff802159e0: NOP > > 0:0xffffffff80215de9: INC_LOCKED_M : stul t1d, DS:[rax + 0x10]:N > > 0:0xffffffff80215de9: INC_LOCKED_M.mfence > > > > disassembly of vmlinux: > > > > ffffffff80215de9: f0 ff 40 10 lock incl 0x10(%rax) > > > > > > As you see, core2 did ldstl(load0)/addi(set1)/stul(store1) with no > > interference. > > However, before core3 did stul(store2), core0 did ldstl(load1), resulting > > in stul(store2). M.mfence did not provide atomicity of the instruction, > at > > all. (Actually, mFence::execute(...) in timing_simple_cpu_exec.cc does > > nothing) > > > > Is there any problem with my explanation? If not, I'll try to fix it, > even > > though it seems not easy for me. Any advice is welcome. > > Thanks, > > Jae-eon Jo. > > _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
