Re: [PATCH] builtin expansion of memcmp for powerpc

2016-09-24 Thread Aaron Sawdey
On Sat, 2016-09-24 at 09:32 -0400, David Edelsohn wrote:
> This patch broke bootstrap on AIX.
> 
> The ChangeLog entry was not committed to the gcc/ChangeLog file.

Changelog entry added. Forgot to write that emacs buffer before commit.

> TARGET_LITTLE_ENDIAN only is defined for PPC64 Linux.  This patch
> should be using !BYTES_BIG_ENDIAN.

Change made, I will commit as obvious once I bootstrap to double check
my work on ppc64le.

Sorry for the mess ...

  Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: [PATCH] builtin expansion of memcmp for powerpc

2016-09-24 Thread David Edelsohn
This patch broke bootstrap on AIX.

The ChangeLog entry was not committed to the gcc/ChangeLog file.

TARGET_LITTLE_ENDIAN only is defined for PPC64 Linux.  This patch
should be using !BYTES_BIG_ENDIAN.

Thanks, David


Re: [PATCH] builtin expansion of memcmp for powerpc

2016-09-22 Thread Segher Boessenkool
Hi Aaron,

On Thu, Sep 22, 2016 at 03:10:24PM -0500, Aaron Sawdey wrote:
> The powerpc target had a movmemsi pattern which supports memcpy() but
> did not have anything for memcmp(). This adds support for builtin
> expansion of memcmp() into inline code for modest constant lengths.
> Performance on power8 is in the range of 3-7x faster than calling
> memcmp() for lengths under 40 bytes.
> 
> Bootstrap on powerpc64le, regtest in progress, OK for trunk if no new
> regressions?

[ You are testing on BE and 32-bit as well, thanks. ]

Just a bit more formatting, and one nit:

> 2016-09-22  Aaron Sawdey  
> 
PR target/77685   <<<--- please add this here
>   * config/rs6000/rs6000.md (cmpmemsi): New define_expand.
>   * config/rs6000/rs6000.c (expand_block_compare): New function used by
>   cmpmemsi pattern to do builtin expansion of memcmp().

Space before (, here and below.

>   (compute_current_alignment): Add helper function for
>   expand_block_compare used to compute alignment as the compare proceeds.
>   (select_block_compare_mode): Used by expand_block_compare to select
>   the mode used for reading the next chunk of bytes in the compare.
>   (do_load_for_compare): Used by expand_block_compare to emit the load
>   insns for the compare.
>   (rs6000_emit_dot_insn): Moved this function to avoid a forward
>   reference from expand_block_compare().
>   * config/rs6000/rs6000-protos.h (expand_block_compare): Add a
>   prototype for this function.
>   * config/rs6000/rs6000.opt (mblock-compare-inline-limit): Add a new
>   target option for controlling how much code inline expansion of
>   memcmp() will be allowed to generate.

You can just say "New function." and the like for most things here, fwiw.

> +/* Figure out the correct instructions to generate to load data for
> +   block compare.  MODE is used for the read from memory, and
> +   data is zero extended if REG is wider than MODE.  If LE code
> +   is being generated, bswap loads are used.

"Emit a load for a block compare", maybe?  I.e. say it emits code, and
it does only one load.

> +  /* final fallback is do one byte */

Capital, dot space space.

> +  /* If this is not a fixed size compare, just call memcmp */

Dot space space.

> +  /* This must be a fixed size alignment */

Dot space space.

> +  /* SLOW_UNALIGNED_ACCESS -- don't do unaligned stuff */

Full sentence.

> +  /* Anything to move? */

Question mark space space.

> +  int offset = 0;
> +  machine_mode load_mode =
> +select_block_compare_mode (offset, bytes, base_align, word_mode_ok);
> +  int load_mode_size = GET_MODE_SIZE (load_mode);

HOST_WIDE_INT instead of the ints?  It isn't slower and I find it hard to
convince myself int always works.

> +   int extra_bytes = load_mode_size - bytes;

Here too.

> +  int remain = bytes - cmp_bytes;

One more.

Very nice improvement, thank you!  This is okay for trunk.


Segher


[PATCH] builtin expansion of memcmp for powerpc

2016-09-22 Thread Aaron Sawdey
The powerpc target had a movmemsi pattern which supports memcpy() but
did not have anything for memcmp(). This adds support for builtin
expansion of memcmp() into inline code for modest constant lengths.
Performance on power8 is in the range of 3-7x faster than calling
memcmp() for lengths under 40 bytes.

Bootstrap on powerpc64le, regtest in progress, OK for trunk if no new
regressions?

2016-09-22  Aaron Sawdey  

* config/rs6000/rs6000.md (cmpmemsi): New define_expand.
* config/rs6000/rs6000.c (expand_block_compare): New function used by
cmpmemsi pattern to do builtin expansion of memcmp().
(compute_current_alignment): Add helper function for
expand_block_compare used to compute alignment as the compare proceeds.
(select_block_compare_mode): Used by expand_block_compare to select
the mode used for reading the next chunk of bytes in the compare.
(do_load_for_compare): Used by expand_block_compare to emit the load
insns for the compare.
(rs6000_emit_dot_insn): Moved this function to avoid a forward
reference from expand_block_compare().
* config/rs6000/rs6000-protos.h (expand_block_compare): Add a
prototype for this function.
* config/rs6000/rs6000.opt (mblock-compare-inline-limit): Add a new
target option for controlling how much code inline expansion of
memcmp() will be allowed to generate.

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC ToolchainIndex: gcc/config/rs6000/rs6000-protos.h
===
--- gcc/config/rs6000/rs6000-protos.h	(revision 240286)
+++ gcc/config/rs6000/rs6000-protos.h	(working copy)
@@ -77,6 +77,7 @@
 extern void rs6000_scale_v2df (rtx, rtx, int);
 extern int expand_block_clear (rtx[]);
 extern int expand_block_move (rtx[]);
+extern bool expand_block_compare (rtx[]);
 extern const char * rs6000_output_load_multiple (rtx[]);
 extern bool rs6000_is_valid_mask (rtx, int *, int *, machine_mode);
 extern bool rs6000_is_valid_and_mask (rtx, machine_mode);
Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c	(revision 240286)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -18423,7 +18423,462 @@
   return 1;
 }
 
+/* Emit a potentially record-form instruction, setting DST from SRC.
+   If DOT is 0, that is all; otherwise, set CCREG to the result of the
+   signed comparison of DST with zero.  If DOT is 1, the generated RTL
+   doesn't care about the DST result; if DOT is 2, it does.  If CCREG
+   is CR0 do a single dot insn (as a PARALLEL); otherwise, do a SET and
+   a separate COMPARE.  */
+
+static void
+rs6000_emit_dot_insn (rtx dst, rtx src, int dot, rtx ccreg)
+{
+  if (dot == 0)
+{
+  emit_move_insn (dst, src);
+  return;
+}
+
+  if (cc_reg_not_cr0_operand (ccreg, CCmode))
+{
+  emit_move_insn (dst, src);
+  emit_move_insn (ccreg, gen_rtx_COMPARE (CCmode, dst, const0_rtx));
+  return;
+}
+
+  rtx ccset = gen_rtx_SET (ccreg, gen_rtx_COMPARE (CCmode, src, const0_rtx));
+  if (dot == 1)
+{
+  rtx clobber = gen_rtx_CLOBBER (VOIDmode, dst);
+  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, ccset, clobber)));
+}
+  else
+{
+  rtx set = gen_rtx_SET (dst, src);
+  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, ccset, set)));
+}
+}
+
 
+/* Figure out the correct instructions to generate to load data for
+   block compare.  MODE is used for the read from memory, and
+   data is zero extended if REG is wider than MODE.  If LE code
+   is being generated, bswap loads are used.
+
+   REG is the destination register to move the data into.
+   MEM is the memory block being read.
+   MODE is the mode of memory to use for the read.  */
+static void
+do_load_for_compare (rtx reg, rtx mem, machine_mode mode)
+{
+  switch (GET_MODE (reg))
+{
+case DImode:
+  switch (mode)
+	{
+	case QImode:
+	  emit_insn (gen_zero_extendqidi2 (reg, mem));
+	  break;
+	case HImode:
+	  {
+	rtx src = mem;
+	if (TARGET_LITTLE_ENDIAN)
+	  {
+		src = gen_reg_rtx (HImode);
+		emit_insn (gen_bswaphi2 (src, mem));
+	  }
+	emit_insn (gen_zero_extendhidi2 (reg, src));
+	break;
+	  }
+	case SImode:
+	  {
+	rtx src = mem;
+	if (TARGET_LITTLE_ENDIAN)
+	  {
+		src = gen_reg_rtx (SImode);
+		emit_insn (gen_bswapsi2 (src, mem));
+	  }
+	emit_insn (gen_zero_extendsidi2 (reg, src));
+	  }
+	  break;
+	case DImode:
+	  if (TARGET_LITTLE_ENDIAN)
+	emit_insn (gen_bswapdi2 (reg, mem));
+	  else
+	emit_insn (gen_movdi (reg, mem));
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+  break;
+
+case SImode:
+  switch (mode)
+	{
+	case QImode:
+	  emit_insn (gen_zero_extendqisi2 (reg, mem));
+	  break;
+	case HImode:
+	  {
+	rtx src = mem;