Richard Henderson <r...@twiddle.net> writes:
> On 06/01/2018 02:32 AM, Richard Sandiford wrote:
>> Richard Henderson <r...@twiddle.net> writes:
>>> I spoke with Ramana about these at HKG18, and I'm finally getting back to
>>> these.  I have routines for
>>>
>>> -rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S
>>> -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S
>>> -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S
>>> -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S
>>> -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S
>>> -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S
>>> -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S
>>> -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S
>>> -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S
>>> -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S
>>> -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S
>>>
>>> The tests pass when run under Foundation Platform 11.3.  What is the best 
>>> way
>>> to submit these for review and upstreaming?  There's nothing in the
>>> git README
>>> about an upstream mailing list...
>>>
>>> FWIW, my code is at
>>>
>>>   https://github.com/rth7680/cortex-strings/tree/rth/sve
>> 
>> Thanks for doing these.  One general comment is that the routines
>> tend to use the FFR result even in the case where no potential
>> fault is detected.  Although it's not as obvious as it could be
>> from some of the published documentation, the architecturally-
>> preferred approach is instead to have the "normal" case depend only
>> on the flags set by RDFFRS, not on the FFR itself.
>
> Is it possible to elaborate on the reasons for that?
> In some cases, the usage of ffr are so trivial (e.g. strlen)
> that I can use the unpredicated RDFFR instead of the predicated.

Yeah, RDFFRS is preferred even then.  The reason is that the FFR
accumulates information across multiple loads (all LDFFs and LDNFs
since the last SETFFR) and so it might not be able to issue until all
those loads have completed.  Using the other structure should mean that
the normal path can be issued in a similar way to loops that use
ordinary loads.

>> Also, using INCB, INCH, INCW and INCD is architecturally preferred over
>> INCP in cases where either could be used.  So if the above loop has a
>> pointer or byte index Xm, and if Pg is all-true, it would be better to do:
>
> This is much easier to understand -- that adding a near constant is preferred
> to popcount over up to 256-bits.  Avoiding INCP for strlen rearranges the loop
> in just the sort of ways you say are preferred for RDFFR.

Yeah.

>> The idea is that the B.NLAST should be highly predictable,
>> so it's usually not necessary to wait for the FFR value to become
>> available.  And in practice, getting a precise FFR predicate is likely
>> to be a slow operation (to the extent that this is an ISA-level
>> principle rather than a uarch optimisation).
>
> Interesting.  From the language that I read, which was essentially
> "first-faulting loads can fail for any reason", I assumed that it
> happened more often than what you're implying here -- that it should
> be on the rare side.

That's more a trade-off between giving software a guarantee about the
specific situations in which a potential fault is and isn't flagged
vs. giving implementations the leeway to do something efficient.
It's not really saying how frequent the "potential fault" case is.

Thanks,
Richard
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to