On Sun, Sep 17, 2017 at 4:41 PM, Kugan Vivekanandarajah
<kugan.vivekanandara...@linaro.org> wrote:
> Hi Andrew,
>
> On 15 September 2017 at 13:36, Andrew Pinski <pins...@gmail.com> wrote:
>> On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah
>> <kugan.vivekanandara...@linaro.org> wrote:
>>> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling
>>> in rtl based on strided-loads in loop.
>>
>> Can you expand on this some more?  Like give an example of where this
>> helps?  I am trying to better understand your counting schemes since
>> it seems like the count is based on the number of loads and not cache
>> lines.
>
> This is a simplified model and I am assuming here that prefetcher will
> tune based on the memory accesses. I don't have access to any of the
> internals of how this is implemented in different microarchitectures
> but I am assuming (in a simplified sense) that hw logic will detect
> memory accesses  patterns and using this it will prefetch the cache
> line. If there are memory accesses like what you have shown that falls
> within the cache line, they may be combined but you still need to
> detect them and tune. And also detecting them at compile time is not
> always easy. So this is a simplified model.
>
>> What do you mean by a strided load?
>> Doesn't this function overcount when you have:
>> for(int i = 1;i<1024;i++)
>>   {
>>     t+= a[i-1]*a[i];
>>   }
>> if it is counting based on cache lines rather than based on load addresses?
> Sorry for my terminology. what I mean by strided access is any memory
> accesses in the form memory[iv]. I am counting memory[iv] and
> memory[iv+1] as two deferent streams. This may or may not fall into
> same cache line.
>
>>
>> It also seems to do some weird counting when you have:
>> for(int i = 1;i<1024;i++)
>>   {
>>     t+= a[(i-1)*N+i]*a[(i)*N+i];
>>   }
>>
>> That is:
>> (PLUS (REG) (REG))
>>
>> Also seems to overcount when loading from the same pointer twice.
>
> If you prefer to count cache line basis, then I am counting it twice
> intentionally.
>
>>
>> In my micro-arch, the number of prefetch slots is based on cache line
>> miss so this would be overcounting by a factor of 2.
>
> I am not entirely sure if this will be useful for all the cores. It is
> shown to beneficial for falkor based on what is done in LLVM.

Can you share at least one benchmark or microbenchmark which shows the
benefit?  Because I can't seem to understand how the falkor core
handles their hardware prefetcher to see if this is beneficial even
there?

Thanks,
Andrew

>
> Thanks,
> Kugan
>>
>> Thanks,
>> Andrew
>>
>>>
>>> Thanks,
>>> Kugan
>>>
>>> gcc/ChangeLog:
>>>
>>> 2017-09-12  Kugan Vivekanandarajah  <kug...@linaro.org>
>>>
>>>     * cfgloop.h (iv_analyze_biv): export.
>>>     * loop-iv.c: Likewise.
>>>     * config/aarch64/aarch64.c (strided_load_p): New.
>>>     (insn_has_strided_load): New.
>>>     (count_strided_load_rtl): New.
>>>     (aarch64_loop_unroll_adjust): New.

Reply via email to