Re: [gem5-users] ARM/O3: Load-linked, store-conditional behavior

Ali Saidi Tue, 20 Nov 2012 07:20:45 -0800

 

Any updates Mitch?


Thanks, 

Ali 

On 11.10.2012 20:44, Mitch
Hayenga wrote: 

> Hi, 
> 
> I have a patch that fixes this in classic
and ruby. I was waiting for another student (Dibakar, he runs a lot more
parallel code than I do) to test it out before submitting to the
reviewboard. I'll bug him and see if he's tested it out yet. 
> 
> On
Thu, Oct 11, 2012 at 7:32 PM, Ali Saidi <[email protected]> wrote:
> 
>>
Hi Mitch, 
>> 
>> Did you end up getting it working? 
>> 
>> Thanks, 
>>
Ali 
>> 
>> On Sep 26, 2012, at 3:39 PM, Steve Reinhardt wrote: 
>> 
>>>
That's a reasonable hardware implementation. Actually you need a
register per hardware thread context, not just per core. 
>>> 
>>> Our
software implementation is intended to model such a hardware
implementation, but the actual software is different for a couple of
reasons. The main one is that we don't want to do two address-based
lookups on every access; CAMs are much cheaper in HW than in SW.
Associating the LL state with each cache block means you can check the
lock state much more cheaply than iterating over a set of lock
registers, particularly in the common case where there are no locks.
Also, the cache typically doesn't know how many CPUs or SMT thread
contexts it's supporting, so it's tricky to allocate the right number of
registers, and the block-based model avoids this problem. 
>>> 
>>> I
think you're right that the one thing we're not emulating properly is
that the recorded lock range should be tight and not be implicitly
expanded to cover the whole block as we've done. So you've convinced me
that that's not just the most straightforward fix, but probably the
right one. 
>>> 
>>> If you get it working, please submit the patch.

>>> 
>>> Thanks! 
>>> 
>>> Steve
>>> 
>>> On Wed, Sep 26, 2012 at 1:25
PM, Mitch Hayenga <[email protected]> wrote:
>>> 
>>>> Hmm, I
had normally thought that LL/SC were handled with special address range
registers @ the cache controller. Since a core should really only have
one outstanding LL/SC pair, a register per core would suffice and
exactly encode the range. Basically doing the same thing that your
more-fine grained locks within the cache block would achieve. 
>>>>

>>>> On Wed, Sep 26, 2012 at 3:08 PM, Steve Reinhardt
<[email protected]> wrote:
>>>> 
>>>>> This is a pretty interesting
issue. I'm not sure how it would be handled in practice. Since the loads
and stores in question are not to the same address, in theory at least
store set predictor should not be involved. My guess is that the most
straightforward fix would be to record the actual range of the LL in the
request structure and only clear the lock flag on a store if the store
truly overlaps (not just if it's to the same block). 
>>>>> 
>>>>> Steve

>>>>> 
>>>>> On Wed, Sep 26, 2012 at 12:50 PM, Mitch Hayenga
<[email protected]> wrote:
>>>>> 
>>>>>> Thanks for the
reply. 
>>>>>> 
>>>>>> Thinking about this... I don't know too much
about the O3 store-set predictor, but it would seem that load-linked
instructions should care about the entire cache line, not just if the
store happens to overlap. Since, it looks like the pending stores write
to the address range [0xf9c2c-0xf9c33], but the load-linked is to
[0xf9c28-0xf9c2b] (non-overlapping, same cache line). So the load issues
early, but the stores come in and clear the lock from the cacheline. So,
either non-LLSC stores (from the same core) shouldn't clear the locks to
a cacheline (src/cache/blk.hh:279). Or the store-set predictor should
hold the linked-load until the stores (to the same cacheline, but not
overlapping) have written back. Dibakar, another grad student here, says
this impacts Ruby as well.
>>>>>> 
>>>>>> On Wed, Sep 26, 2012 at 1:27
PM, Ali Saidi <[email protected]> wrote: 
>>>>>> 
>>>>>>> Hi Mitch,

>>>>>>> 
>>>>>>> I wonder if this happens in the steady state? With the
implementation the store-set predictor should predict that the store is
going to conflict the load and order them. Perhaps that isn't getting
trained correctly with LLSC ops. You really don't want to mark the ops
as serializing as that slows down the cpu quite a bit. 
>>>>>>> 
>>>>>>>
Thanks, 
>>>>>>> 
>>>>>>> Ali 
>>>>>>> 
>>>>>>> On 26.09.2012 13 [2]:14,
Mitch Hayenga wrote: 
>>>>>>> 
>>>>>>>> Background: 
>>>>>>>> I have a
non-o3, out of order CPU implemented on gem5. Since I don't have a
checker implemented yet, I tend to diff committed instructions vs o3.
Yesterday's patches caused a few of these diffs change because of
load-linked/store-conditional behavior (better prediction on data ops
that write the PC leads to denser load/store scheduling). 
>>>>>>>>
Issue: 
>>>>>>>> It seems O3's own loads/stores can cause its
load-linked/store-conditional pair to fail. Previously running a single
core under SE, every load-linked/store-conditional pair would succeed.
Now I'm observing them failing 21% of the time (on single-threaded
programs). Although the programs functionally work given how the LL/SC
is coded currently, I think this points to the fact LL/SC should be
handled slightly differently. 
>>>>>>>> Example: 
>>>>>>>> From "Hello
World" on ARM+O3+Single Core+SE+Classic Memory that shows this. This
contains locks because I assume the C++ library is thread-safe.

>>>>>>>> http://pastebin.com/sNjTPBWY [1] 
>>>>>>>> The O3 CPU is
effectively doing a "Test and TestAndSet". It looks like the load for
the Test and the load-linked for the race for memory. Also, the CPU has
a pending writeback to the same line. So effectively, the TestAndSet
fails (haven't dug into it to determine if it was the racing load or the
writeback that caused the failure). 
>>>>>>>> Given this, shouldn't
load-linked (in this case ldrex) instructions be marked as
non-speculative (or one of the other flags) so that they don't contend
with earlier operations? 
>>>>>>>> Thanks.
>>>>>>> 
>>>>>>>
_______________________________________________
>>>>>>> gem5-users
mailing list
>>>>>>> [email protected]
>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
>>>>>> 
>>>>>>
_______________________________________________
>>>>>> gem5-users
mailing list
>>>>>> [email protected]
>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
>>>>> 
>>>>>
_______________________________________________
>>>>> gem5-users mailing
list
>>>>> [email protected]
>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
>>>> 
>>>>
_______________________________________________
>>>> gem5-users mailing
list
>>>> [email protected]
>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
>>>
_______________________________________________
>>> gem5-users mailing
list
>>> [email protected]
>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
>> 
>>
_______________________________________________
>> gem5-users mailing
list
>> [email protected]
>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]
> 
>
_______________________________________________
> gem5-users mailing
list
> [email protected]
>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [3]




Links:
------
[1] http://pastebin.com/sNjTPBWY
[2]
tel:26.09.2012%2013
[3]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] ARM/O3: Load-linked, store-conditional behavior

Reply via email to