Ping ?

I see that Jim has clarified the comments from Andrew.

Thanks,
Kugan

On 13 October 2017 at 08:48, Jim Wilson <wil...@tuliptree.org> wrote:
> On Fri, 2017-09-22 at 14:11 -0700, Andrew Pinski wrote:
>> On Fri, Sep 22, 2017 at 11:39 AM, Jim Wilson <jim.wil...@linaro.org>
>> wrote:
>> >
>> > On Fri, Sep 22, 2017 at 10:58 AM, Andrew Pinski <pins...@gmail.com>
>> > wrote:
>> > >
>> > > Two overall comments:
>> > > * What about splitting register_offset into two different
>> > > elements,
>> > > one for non 128bit modes and one for 128bit (and more; OI, etc.)
>> > > modes
>> > > so you get better address generation right away for the simd load
>> > > cases rather than having LRA/reload having to reload the address
>> > > into
>> > > a register.
>> > I'm not sure if changing register_offset cost would make a
>> > difference,
>> > since costs are usually used during optimization, not during
>> > address
>> > generation.  This is something that I didn't think to try
>> > though.  I
>> > can try taking a look at this.
>> It does taken into account when fwprop is propagating the addition
>> into
>> the MEM (the tree level is always a_1 = POINTER_PLUS_EXPR;
>> MEM_REF(a_1)).
>> IV-OPTS will produce much better code if the address_cost is correct.
>>
>> It looks like no other pass (combine, etc.) would take that into
>> account except for postreload CSE but maybe they should.
>
> I tried increasing the cost of register_offset.  This got rid of the
> reg+reg addressing mode in the middle of the main loop for lmbench
> stream copy, but did not eliminate it after the main loop.
>
> The tree optimized dump has
>   _52 = a_15 + _51;
>   _53 = c_17 + _51;
>   _54 = *_52;
>   *_53 = _54;
> and the RTL expand dump has
> (insn 64 63 65 10 (set (reg:DF 96 [ _54 ])
>         (mem:DF (plus:DI (reg/v/f:DI 78 [ a ])
>                 (reg:DI 93 [ _51 ])) [3 *_52+0 S8 A64])) "stream.c":223
> -1
>      (nil))
> (insn 65 64 66 10 (set (mem:DF (plus:DI (reg/v/f:DI 79 [ c ])
>                 (reg:DI 93 [ _51 ])) [3 *_53+0 S8 A64])
>         (reg:DF 96 [ _54 ])) "stream.c":223 -1
>      (nil))
>
> That may be fixable, but there is a bigger problem here which is that
> increasing the costs of register_offset affects both loads and stores.
>  On falkor, it is only quad-word stores that are inefficient with a
> reg+reg address.  Quad-word loads with a reg+reg address are faster
> than the equivalent add/ldr.  Disabling reg+reg address for quad-word
> loads will hurt performance.
>
> Since the address cost stuff makes no distinction between loads and
> stores, I see no way to get the result I need by using address costs.
>  I can only get the result I need by modifying the md file.
>
>> > I did try writing a patch to modify predicates to disallow reg
>> > offset
>> > for 128bit modes, and that got complicated, as I had to split apart
>> > a
>> > number of patterns in the aarch64-simd.md file that accept both VD
>> > and
>> > VQ modes.  I ended up with a patch 3-4 times as big as the one I
>> > submitted, without any additional performance improvement, so it
>> > wasn't worth the trouble.
>> >
>> > >
>> > > * Maybe adding a testcase to the testsuite to show this change.
>> > Yes, I can add a testcase.
>> >
>> > >
>> > > One extra comment:
>> > > * should we change the generic tuning to avoid reg+reg for 128bit
>> > > modes?
>> > Are there other targets with a similar problem?  I only know that
>> > it
>> > is a problem for Falkor.  It might be a loss for some targets as it
>> > is
>> > replacing one instruction with two.
>> Well that is why I was suggesting the address cost model change.
>> Because the cost model change actually might provide better code in
>> the first place and still allow for reasonable generic code to be
>> produced.
>
> The patch I posted only affects Falkor.  It doesn't change generic
> code.  I don't know of any reason why we need to change generic code
> here.
>
> The Falkor core has out-of-order execution and multiple function units,
> so there isn't any noticeable performance gain from trying to fix this
> earlier.  Fixing this with a md file change gives optimal performance
> for the testcases I've looked at.
>
> Since I'm no longer at Linaro, I expect that someone else will take
> over this patch submission.  I will create a bug report to document the
> issue, to make it easier to track it and hand off to someone else.
>
> Jim
>

Reply via email to