Re: [gem5-users] ARM/O3: Load-linked, store-conditional behavior

Mitch Hayenga Thu, 11 Oct 2012 18:44:16 -0700

Hi,

I have a patch that fixes this in classic and ruby.  I was waiting for
another student (Dibakar, he runs a lot more parallel code than I do) to
test it out before submitting to the reviewboard.  I'll bug him and see if
he's tested it out yet.


On Thu, Oct 11, 2012 at 7:32 PM, Ali Saidi <[email protected]> wrote:

> Hi Mitch,
>
> Did you end up getting it working?
>
> Thanks,
> Ali
>
> On Sep 26, 2012, at 3:39 PM, Steve Reinhardt wrote:
>
> That's a reasonable hardware implementation.  Actually you need a register
> per hardware thread context, not just per core.
>
> Our software implementation is intended to model such a hardware
> implementation, but the actual software is different for a couple of
> reasons.  The main one is that we don't want to do two address-based
> lookups on every access; CAMs are much cheaper in HW than in SW.
>  Associating the LL state with each cache block means you can check the
> lock state much more cheaply than iterating over a set of lock registers,
> particularly in the common case where there are no locks.  Also, the cache
> typically doesn't know how many CPUs or SMT thread contexts it's
> supporting, so it's tricky to allocate the right number of registers, and
> the block-based model avoids this problem.
>
> I think you're right that the one thing we're not emulating properly is
> that the recorded lock range should be tight and not be implicitly expanded
> to cover the whole block as we've done.  So you've convinced me that that's
> not just the most straightforward fix, but probably the right one.
>
> If you get it working, please submit the patch.
>
> Thanks!
>
> Steve
>
> On Wed, Sep 26, 2012 at 1:25 PM, Mitch Hayenga <
> [email protected]> wrote:
>
>> Hmm, I had normally thought that LL/SC were handled with special address
>> range registers @ the cache controller.  Since a core should really only
>> have one outstanding LL/SC pair, a register per core would suffice and
>> exactly encode the range.  Basically doing the same thing that your
>> more-fine grained locks within the cache block would achieve.
>>
>>
>> On Wed, Sep 26, 2012 at 3:08 PM, Steve Reinhardt <[email protected]>wrote:
>>
>>> This is a pretty interesting issue.  I'm not sure how it would be
>>> handled in practice.  Since the loads and stores in question are not to the
>>> same address, in theory at least store set predictor should not be
>>> involved.  My guess is that the most straightforward fix would be to record
>>> the actual range of the LL in the request structure and only clear the lock
>>> flag on a store if the store truly overlaps (not just if it's to the same
>>> block).
>>>
>>> Steve
>>>
>>>
>>> On Wed, Sep 26, 2012 at 12:50 PM, Mitch Hayenga <
>>> [email protected]> wrote:
>>>
>>>> Thanks for the reply.
>>>>
>>>> Thinking about this... I don't know too much about the O3 store-set
>>>> predictor, but it would seem that load-linked instructions should care
>>>> about the entire cache line, not just if the store happens to overlap.
>>>>  Since, it looks like the pending stores write to the address range
>>>> [0xf9c2c-0xf9c33], but the load-linked is to [0xf9c28-0xf9c2b]
>>>> (non-overlapping, same cache line).   So the load issues early, but the
>>>> stores come in and clear the lock from the cacheline.  So, either non-LLSC
>>>> stores (from the same core) shouldn't clear the locks to a cacheline
>>>> (src/cache/blk.hh:279).  Or the store-set predictor should hold the
>>>> linked-load until the stores (to the same cacheline, but not overlapping)
>>>> have written back.  Dibakar, another grad student here, says this impacts
>>>> Ruby as well.
>>>>
>>>> On Wed, Sep 26, 2012 at 1:27 PM, Ali Saidi <[email protected]> wrote:
>>>>
>>>>>  **
>>>>>
>>>>> Hi Mitch,
>>>>>
>>>>>
>>>>> I wonder if this happens in the steady state? With the implementation
>>>>> the store-set predictor should predict that the store is going to conflict
>>>>> the load and order them. Perhaps that isn't getting trained correctly with
>>>>> LLSC ops. You really don't want to mark the ops as serializing as that
>>>>> slows down the cpu quite a bit.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>> On 26.09.2012 13:14, Mitch Hayenga wrote:
>>>>>
>>>>> Background:
>>>>> I have a non-o3, out of order CPU implemented on gem5.  Since I don't
>>>>> have a checker implemented yet, I tend to diff committed instructions vs
>>>>> o3.  Yesterday's patches caused a few of these diffs change because of
>>>>> load-linked/store-conditional behavior (better prediction on data ops that
>>>>> write the PC leads to denser load/store scheduling).
>>>>> Issue:
>>>>> It seems O3's own loads/stores can cause its
>>>>> load-linked/store-conditional pair to fail.  Previously running a single
>>>>> core under SE, every load-linked/store-conditional pair would succeed.  
>>>>> Now
>>>>> I'm observing them failing 21% of the time (on single-threaded programs).
>>>>>  Although the programs functionally work given how the LL/SC is coded
>>>>> currently, I think this points to the fact LL/SC should be handled 
>>>>> slightly
>>>>> differently.
>>>>> Example:
>>>>> From "Hello World" on ARM+O3+Single Core+SE+Classic Memory that shows
>>>>> this.   This contains locks because I assume the C++ library is 
>>>>> thread-safe.
>>>>> http://pastebin.com/sNjTPBWY
>>>>> The O3 CPU is effectively doing a "Test and TestAndSet".  It looks
>>>>> like the load for the Test and the load-linked for the race for memory.
>>>>>  Also, the CPU has a pending writeback to the same line.  So effectively,
>>>>> the TestAndSet fails (haven't dug into it to determine if it was the 
>>>>> racing
>>>>> load or the writeback that caused the failure).
>>>>> Given this, shouldn't load-linked (in this case ldrex) instructions be
>>>>> marked as non-speculative (or one of the other flags) so that they don't
>>>>> contend with earlier operations?
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> [email protected]
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> [email protected]
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] ARM/O3: Load-linked, store-conditional behavior

Reply via email to