Re: k-byte memset/memcpy/strlen builtins

2017-01-12 Thread Martin Sebor

On 01/11/2017 09:16 AM, Robin Dapp wrote:

Hi,

When examining the performance of some test cases on s390 I realized
that we could do better for constructs like 2-byte memcpys or
2-byte/4-byte memsets. Due to some s390-specific architectural
properties, we could be faster by e.g. avoiding excessive unrolling and
using dedicated memory instructions (or similar).


There are at least two enhancement requests in Bugzilla to improve
memcmp when one or more of the arguments are constant: bugs 12086
and 78257 (the former for constant small lengths and the latter
for constant byte arrays).  It seems that one or both of these might
also benefit from some of your ideas and/or vice versa.

Martin


For 1-byte memset/memcpy the builtin functions provide a straightforward
way to achieve this. At first sight it seemed possible to extend
tree-loop-distribution.c to include the additional variants we need.
However, multibyte memsets/memcpys are not covered by the C standard and
I'm therefore unsure if such an approach is preferable or if there are
more idiomatic ways or places where to add the functionality.

The same question goes for 2-byte strlen. I didn't see a recognition
pattern for strlen (apart from optimizations due to known string length
in tree-ssa-strlen.c). Would it make sense to include strlen recognition
and subsequently handling for 2-byte strlen? The situation might of
course more complicated than memset because of encodings etc. My snippet
in question used a fixed-length encoding of 2 bytes, however.

Another simple idea to tackle this would be a peephole optimization but
I'm not sure if this is really feasible for something like memset.
Wouldn't the peephole have to be recursive then?

Regards
 Robin





Re: k-byte memset/memcpy/strlen builtins

2017-01-12 Thread Richard Biener
On Thu, Jan 12, 2017 at 9:26 AM, Robin Dapp  wrote:
>> Yes, for memset with larger element we could add an optab plus
>> internal function combination and use that when the target wants.  Or
>> always use such IFN and fall back to loopy expansion.
>
> So, adding additional patterns in tree-loop-distribute.c (and mapping
> them to dedicated optabs) is fine? Or does the yes refer to the
> "else"/"or" part of my question (how would the backend recognize the
> patterns then)?

Yes, enhancing tree-loop-distribution.c with extra patterns is fine (hey, there
were supposed to be patterns for all of lapack & friends ... ;))

The question is only whether loop-distribution should always create the IFN
or just if the backend has an optab so expansion is trivial (rather than
needing to re-build a loop doing the operation).

Richard.

>> I'd say a multibyte memchr might make sense, but strlen specifically?
>> Not sure.
>
> ok, memchr would also work for the snippet I have in mind.
>
> Regards
>  Robin
>


Re: k-byte memset/memcpy/strlen builtins

2017-01-12 Thread Robin Dapp
> Yes, for memset with larger element we could add an optab plus
> internal function combination and use that when the target wants.  Or
> always use such IFN and fall back to loopy expansion.

So, adding additional patterns in tree-loop-distribute.c (and mapping
them to dedicated optabs) is fine? Or does the yes refer to the
"else"/"or" part of my question (how would the backend recognize the
patterns then)?

> I'd say a multibyte memchr might make sense, but strlen specifically?
> Not sure.

ok, memchr would also work for the snippet I have in mind.

Regards
 Robin



Re: k-byte memset/memcpy/strlen builtins

2017-01-11 Thread Aaron Sawdey
On Wed, 2017-01-11 at 17:16 +0100, Robin Dapp wrote:
> Hi,

Hi Robin,
  I thought I'd share some of what I've run into while doing similar
things for the rs6000 target.

First off, be aware that glibc does some macro expansion things to try
to handle 1/2/3 byte string operations in some cases.

Secondly, the way I approached this was to use the patterns 
defined in optabs.def for these things:

OPTAB_D (cmpmem_optab, "cmpmem$a")
OPTAB_D (cmpstr_optab, "cmpstr$a")
OPTAB_D (cmpstrn_optab, "cmpstrn$a")
OPTAB_D (movmem_optab, "movmem$a")
OPTAB_D (setmem_optab, "setmem$a")
OPTAB_D (strlen_optab, "strlen$a")

If you define movmemsi, that should get used by expand_builtin_memcpy
for any memcpy call that it sees.

The constraints I was able to find when implementing cmpmemsi for
memcmp were:
 * don't compare past the given length (obviously)
 * don't read past the given length
 * except it's ok to do so if you can prove via alignment or
   runtime check that you are not going to cause a pagefault.
   Not crossing a 4k boundary seems to be generally viewed as
   acceptable.

I would recommend looking at preprocessed code to make sure no funny
business is happening, and then look at your .md files. It looks to me
like s390 has got both movmem and strlen patterns there already.

If I understand correctly you are wanting to do multi-byte characters.
Seems to me you need to follow the path Richard Biener suggests and
make optab expansions that handle wider chars and then perhaps map
wcslen et. al. to them?

   Aaron
> 
> For 1-byte memset/memcpy the builtin functions provide a
> straightforward
> way to achieve this. At first sight it seemed possible to extend
> tree-loop-distribution.c to include the additional variants we need.
> However, multibyte memsets/memcpys are not covered by the C standard
> and
> I'm therefore unsure if such an approach is preferable or if there
> are
> more idiomatic ways or places where to add the functionality.
> 
> The same question goes for 2-byte strlen. I didn't see a recognition
> pattern for strlen (apart from optimizations due to known string
> length
> in tree-ssa-strlen.c). Would it make sense to include strlen
> recognition
> and subsequently handling for 2-byte strlen? The situation might of
> course more complicated than memset because of encodings etc. My
> snippet
> in question used a fixed-length encoding of 2 bytes, however.
> 
> Another simple idea to tackle this would be a peephole optimization
> but
> I'm not sure if this is really feasible for something like memset.
> Wouldn't the peephole have to be recursive then?
> 
> Regards
>  Robin
> 
-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: k-byte memset/memcpy/strlen builtins

2017-01-11 Thread Richard Biener
On January 11, 2017 5:16:43 PM GMT+01:00, Robin Dapp  
wrote:
>Hi,
>
>When examining the performance of some test cases on s390 I realized
>that we could do better for constructs like 2-byte memcpys or
>2-byte/4-byte memsets. Due to some s390-specific architectural
>properties, we could be faster by e.g. avoiding excessive unrolling and
>using dedicated memory instructions (or similar).

Not sure why you mention memcpy, how does that depend on 'element size'?

>For 1-byte memset/memcpy the builtin functions provide a
>straightforward
>way to achieve this. At first sight it seemed possible to extend
>tree-loop-distribution.c to include the additional variants we need.
>However, multibyte memsets/memcpys are not covered by the C standard
>and
>I'm therefore unsure if such an approach is preferable or if there are
>more idiomatic ways or places where to add the functionality.

Yes, for memset with larger element we could add an optab plus internal 
function combination and use that when the target wants.  Or always use such 
IFN and fall back to loopy expansion.

>The same question goes for 2-byte strlen. I didn't see a recognition
>pattern for strlen (apart from optimizations due to known string length
>in tree-ssa-strlen.c). Would it make sense to include strlen
>recognition
>and subsequently handling for 2-byte strlen? The situation might of

I'd say a multibyte memchr might make sense, but strlen specifically?  Not sure.

Likewise multibyte memcmp.

Richard.

>course more complicated than memset because of encodings etc. My
>snippet
>in question used a fixed-length encoding of 2 bytes, however.
>
>Another simple idea to tackle this would be a peephole optimization but
>I'm not sure if this is really feasible for something like memset.
>Wouldn't the peephole have to be recursive then?
>
>Regards
> Robin



k-byte memset/memcpy/strlen builtins

2017-01-11 Thread Robin Dapp
Hi,

When examining the performance of some test cases on s390 I realized
that we could do better for constructs like 2-byte memcpys or
2-byte/4-byte memsets. Due to some s390-specific architectural
properties, we could be faster by e.g. avoiding excessive unrolling and
using dedicated memory instructions (or similar).

For 1-byte memset/memcpy the builtin functions provide a straightforward
way to achieve this. At first sight it seemed possible to extend
tree-loop-distribution.c to include the additional variants we need.
However, multibyte memsets/memcpys are not covered by the C standard and
I'm therefore unsure if such an approach is preferable or if there are
more idiomatic ways or places where to add the functionality.

The same question goes for 2-byte strlen. I didn't see a recognition
pattern for strlen (apart from optimizations due to known string length
in tree-ssa-strlen.c). Would it make sense to include strlen recognition
and subsequently handling for 2-byte strlen? The situation might of
course more complicated than memset because of encodings etc. My snippet
in question used a fixed-length encoding of 2 bytes, however.

Another simple idea to tackle this would be a peephole optimization but
I'm not sure if this is really feasible for something like memset.
Wouldn't the peephole have to be recursive then?

Regards
 Robin