Re: [RFC] Speed-up -fprofile-update=atomic

Martin Liška Wed, 12 Oct 2016 07:01:25 -0700

On 10/04/2016 11:45 AM, Richard Biener wrote:
> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mli...@suse.cz> wrote:
>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mli...@suse.cz> wrote:
>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <ja...@redhat.com> 
>>>>> wrote:
>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.
>>>>>>>> The best proxy we have for that is -pthread.
>>>>>>>>
>>>>>>>> Is it slower, most definitely, but odds are we're giving folks
>>>>>>>> garbage data otherwise, which in many ways is even worse.
>>>>>>>
>>>>>>> It will likely be catastrophically slower in some cases.
>>>>>>>
>>>>>>> Catastrophically as in too slow to be usable.
>>>>>>>
>>>>>>> An atomic instruction is a lot more expensive than a single
>>>>>> increment. Also
>>>>>>> they sometimes are really slow depending on the state of the machine.
>>>>>>
>>>>>> Can't we just have thread-local copies of all the counters (perhaps
>>>>>> using
>>>>>> __thread pointer as base) and just atomically merge at thread
>>>>>> termination?
>>>>>
>>>>> I suggested that as well but of course it'll have its own class of issues 
>>>>> (short lived threads, so we need to somehow re-use counters from 
>>>>> terminated threads, large number of threads and thus using too much 
>>>>> memory for the counters)
>>>>>
>>>>> Richard.
>>>>
>>>> Hello.
>>>>
>>>> I've got written the approach on my TODO list, let's see whether it would 
>>>> be doable in a reasonable amount of time.
>>>>
>>>> I've just finished some measurements to illustrate slow-down of 
>>>> -fprofile-update=atomic approach.
>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
>>>> -fprofile-update=atomic
>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>>>
>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
>>>> having much experience with default option
>>>> selection, but these numbers can probably help.
>>>>
>>>> Thoughts?
>>>
>>> Look at the generated code for an instrumented simple loop and see that for
>>> the non-atomic updates we happily apply store-motion to the counter update
>>> and thus we only get one counter update per loop exit rather than one per
>>> loop iteration.  Now see what happens for the atomic case (I suspect you
>>> get one per iteration).
>>>
>>> I'll bet this accounts for most of the slowdown.
>>>
>>> Back in time ICC which had atomic counter updates (but using function
>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>> didn't have early inlining -- removing abstraction helps reducing the number
>>> of counters significantly).
>>>
>>> Richard.
>>
>> Hi.
>>
>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>> profile counter updates. My first attempt is to utilize TLS storage, where
>> every function is accumulating arcs counters. These are eventually added
>> (using atomic operations) to the global one at the very end of a function.
>> Currently I rely on target support of TLS, which is questionable whether
>> to have such a requirement for -fprofile-update=atomic, or to add a new 
>> option value
>> like -fprofile-update=atomic-tls?
>>
>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
>> finish.
>> Time shrinks to 50%, compared to the current implementation.
>>
>> Thoughts?
> 
> Hmm, I thought I suggested that you can simply use automatic storage
> (which effectively
> is TLS...) for regions that are not forked or abnormally left (which
> means SESE regions
> that have no calls that eventually terminate or throw externally).
> 
> So why did you end up with TLS?


Hi.

Usage for TLS does not makes sense, stupid mistake ;)

By using SESE regions, do you mean the infrastructure that is utilized
by Graphite machinery?

Thanks,
Martin

> 
> Richard.
> 
>> Martin
>>
>>>
>>>> Martin
>>>>
>>>>>
>>>>>>      Jakub
>>>>>
>>>>>
>>>>
>>

Re: [RFC] Speed-up -fprofile-update=atomic

Reply via email to