On Thu, Dec 20, 2012 at 11:35 AM, Rong Xu <x...@google.com> wrote:
> we have this patch primarily for getting valid profile counts. we
> observe that for some high-threaded programs, we are getting poor
> counter due to data racing of counter update (like counter value is
> only 15% of what it supposed to be for a 10-thread program).

I have seen much worse on Octeon running with 32-threaded program.  I
think it was only 1% of what it should have been.


>
> In general, enabling atomic updates slows down programs. (for my some
> of my toy programs, it has 3x slow down.) And that the reason I use
> options to control value and edge profile count.

I think on Octeon, the atomic updates would be a speedup because of
the atomic instruction which was added explicitly for incrementing a
statistics counter.  Internally at Cavium, I might just turn this on
by default as it even helps the one thread case :).

Thanks,
Andrew Pinski

>
> -Rong
>
> On Thu, Dec 20, 2012 at 8:57 AM, Andrew Pinski <pins...@gmail.com> wrote:
>> On Thu, Dec 20, 2012 at 8:20 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>>>> On Wed, Dec 19, 2012 at 4:29 PM, Andrew Pinski <pins...@gmail.com> wrote:
>>>> >
>>>> > On Wed, Dec 19, 2012 at 12:08 PM, Rong Xu <x...@google.com> wrote:
>>>> > > Hi,
>>>> > >
>>>> > > This patch adds the supprot of atomic update the profile counters.
>>>> > > Tested with google internal benchmarks and fdo kernel build.
>>>> >
>>>> > I think you should use the __atomic_ functions instead of __sync_
>>>> > functions as they allow better performance for simple counters as you
>>>> > can use __ATOMIC_RELAXED.
>>>>
>>>> You are right. I think __ATOMIC_RELAXED should be OK here.
>>>> Thanks for the suggestion.
>>>>
>>>> >
>>>> > And this would be useful for the trunk also.  I was going to implement
>>>> > this exact thing this week but some other important stuff came up.
>>>>
>>>> I'll post trunk patch later.
>>>
>>> Yes, I like that patch, too. Even if the costs are quite high (and this is 
>>> why
>>> atomic updates was sort of voted down in the past) the alternative of using 
>>> TLS
>>> has problems with too-much per-thread memory.
>>
>> Actually sometimes (on some processors) atomic increments are cheaper
>> than doing a regular incremental.  Mainly because there is an
>> instruction which can handle it in the L2 cache rather than populating
>> the L1.   Octeon is one such processor where this is true.
>>
>> Thanks,
>> Andrew Pinski
>>
>>>
>>> While there are even more alternatives, like recording the changes and
>>> commmiting them in blocks (say at function return), I guess some solution is
>>> better than no solution.
>>>
>>> Thanks,
>>> Honza

Reply via email to