from:"Allan Sandfeld Jensen"

Re: DWZ 0.14 released

2021-03-09 Thread Allan Sandfeld Jensen

Btw, question for gcc/binutils

Any reason the work done by tools like dwz couldn't be done in the compiler or 
linker? Seems a bit odd to have a post-linker that optimizes the generated 
code, when optimizations should already be enabled.

Best regards
Allan

On Montag, 8. März 2021 13:43:11 CET Tom de Vries wrote:
> Hi,
> 
> DWZ 0.14 has been released.
> 
> You can download dwz from the sourceware FTP server here:
> 
> https://sourceware.org/ftp/dwz/releases/
> ftp://sourceware.org/pub/dwz/releases/
> 
> The vital stats:
> 
>   Sizemd5sumName
>   184KiB  cf60e4a65d9cc38c7cdb366e9a29ca8e  dwz-0.14.tar.gz
>   144KiB  1f1225898bd40d63041d54454fcda5b6  dwz-0.14.tar.xz
> 
> There is a web page for DWZ at:
> 
> https://sourceware.org/dwz/
> 
> DWZ 0.14 includes the following changes and enhancements:
> 
> * DWARF 5 support. The tool now handles most of DWARF version 5
>   (at least everything emitted by GCC when using -gdwarf-5).
> 
>   Not yet supported are DW_UT_type units (DWARF 4 .debug_types
>   are supported), .debug_names (.gdb_index is supported) and some
>   forms and sections that are only emitted by GCC when
>   generating Split DWARF (DW_FORM_strx and .debug_str_offsets,
>   DW_FORM_addrx and .debug_addr, DW_FORM_rnglistx and
>   DW_FORM_loclistsx). https://sourceware.org/PR24726
> 
> * .debug_sup support. DWARF Supplementary Object Files
>   (DWARF 5, section 7.3.6) can now be generated when using
>   the --dwarf-5 option. To keep compatibility with existing DWARF
>   consumers this isn't the default yet.
> 
>   Without the --dwarf-5 option instead of a .debug_sup section dwz
>   will generate a .gnu_debugaltlink section and will use
>   DW_FORM_GNU_strp_alt and DW_FORM_GNU_reg_alt, instead of
>   DW_FORM_strp_sup and DW_FORM_ref_sup
> 
> * An experimental optimization has been added that exploits the
>   One-Definition-Rule of C++.  It's enabled using the --odr option, and
>   off by default.  This optimization causes struct/union/class DIEs with
>   the same name to be considered equal.  The optimization can be set to
>   a lower aggressiveness level using --odr-mode=basic, to possibly be
>   able to workaround problems without having to switch off the
>   optimization altogether.
> 
> * The clean-up of temporary files in hardlink mode has been fixed.
> 
> * The DIE limits --low-mem-die-limit  / -l  and
>   --max-die-limit  / -L  can now be disabled using respectively
>   -l none and -L none.  Note that -l none disables the limit, whereas
>   -l 0 sets the limit to zero.
> 
> * The usage message has been:
>   - updated to show that -r and -M are exclusive.
>   - updated to show at -v and -? cannot be combined with other options.
>   - extended to list all options in detail.
>   - restyled to wrap at 80 chars.
> 
> * An option --no-import-optimize was added that switches off an
>   optimization that attempts to reduce the number of
>   DW_TAG_imported_unit DIEs.  This can be used f.i. in case the
>   optimization takes too long.
> 
> * A heuristic has been added that claims more memory earlier (without
>   increasing the peak memory usage) to improve compression time.
> 
> * A heuristic has been added that estimates whether one of the two DIE
>   limits will be hit.  If so, it will do an exact DIE count to verify
>   this.  If the exact DIE count finds that the low-mem DIE limit is
>   indeed hit, processing is done in low-mem mode from the start, rather
>   than processing in regular mode first.  If the exact DIE count finds
>   that the max DIE limit is indeed hit, processing is skipped
>   altogether.
> 
> * Various other performance improvements.
> 
> * A case where previously we would either hit the assertion
>   "dwz: dwz.c:9461: write_die: Assertion `refd != NULL' failed" (in
>   regular mode) or a segmentation fault (in low-mem mode), now is
>   handled by "dwz: Couldn't find DIE at DW_FORM_ref_addr offset 0x".
> 
> * A case where a reference from a partial unit to a compile unit was
>   generated has been fixed.  This could happen if a DIE was referenced
>   using a CU-relative DWARF operator.
> 
> * A case has been fixed for low-mem mode where instead of issuing
>   "dwz: Couldn't find DIE referenced by  DW_OP_GNU_implicit_pointer" dwz
>   would run into a segfault instead.
> 
> * A multi-file case where we run into ".debug_line reference above end
>   of section" has been fixed.
> 
> * The following assertion failures were fixed:
>   - dwz: dwz.c:9310: write_die: Assertion `
>   value && refdcu->cu_kind != CU_ALT
> ' failed.
>   - dwz: dwz.c:9920: recompute_abbrevs: Assertion `
>   off == cu_size
> ' failed.
> 
> * The assert condition of this assertion has been fixed:
>   - write_types: Assertion `ref && ref->die_dup == NULL'.

Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-30 Thread Allan Sandfeld Jensen

On Montag, 30. November 2020 16:47:08 CET Michael Matz wrote:
> Hello,
> 
> On Sun, 29 Nov 2020, Allan Sandfeld Jensen wrote:
> > On Sonntag, 29. November 2020 18:38:15 CET Florian Weimer wrote:
> > > * Allan Sandfeld Jensen:
> > > > If you _do_ change it. I would suggest changing it to 120, which is
> > > > next
> > > > common step for a lot of C++ projects.
> > > 
> > > 120 can be problematic for a full HD screen in portrait mode.  Nine
> > > pixels per character is not a lot (it's what VGA used), and you can't
> > > have any window decoration.  With a good font and screen, it's doable.
> > > But if the screen isn't quite sharp, then I think you wouldn't be able
> > > to use portrait mode anymore.
> > 
> > Using a standard condensed monospace font of 9px, it has a width of 7px,
> > 120
> A char width of 7px implies a cell width of at least 8px (so 960px for 120
> chars), more often of 9px.  With your cell width of 7px your characters
> will be max 6px, symmetric characters will be 5px, which is really small.
> 
I was talking about the full cell width. I tested it before commenting, 
measuring the width in pixels of a line of text.

'Allan

Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-29 Thread Allan Sandfeld Jensen

On Sonntag, 29. November 2020 18:38:15 CET Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > If you _do_ change it. I would suggest changing it to 120, which is next
> > common step for a lot of C++ projects.
> 
> 120 can be problematic for a full HD screen in portrait mode.  Nine
> pixels per character is not a lot (it's what VGA used), and you can't
> have any window decoration.  With a good font and screen, it's doable.
> But if the screen isn't quite sharp, then I think you wouldn't be able
> to use portrait mode anymore.

Using a standard condensed monospace font of 9px, it has a width of 7px, 120 
char would take up 940px fitting two windows in horizontal mode and one in 
vertical. 9px isn't fuzzy, and 8px variants are even narrower.

Sure using square monospace fonts might not fit, but that is an unusual 
configuration and easily worked around by living with a non-square monospace 
font, or accepting occational line overflow. Remember nobody is suggesting 
every line should be that long, just allowing it to allow better structural 
indentation.

'Allan

Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-27 Thread Allan Sandfeld Jensen

On Freitag, 27. November 2020 00:50:57 CET Jonathan Wakely via Gcc wrote:
> I've touched on the subject a few times, e.g.
> https://gcc.gnu.org/pipermail/gcc/2019-December/230993.html
> and https://gcc.gnu.org/pipermail/gcc/2019-December/231013.html
> 
> Libstdc++ code is indented by 2 columns for the enclosing namespace,
> usually another two for being in a template, and is full of __
> prefixes for reserved names. On top of that, modern C++ declarations
> are *noisy* (template head, requires-clause, noexcept-specifier, often
> 'constexpr' or 'inline' and 'explicit', and maybe some attributes.
> 
> All that gets hard to fit in 80 columns without compromising
> readability with line breaks in unnatural places.
> 
> Does anybody object to raising the line length for libstdc++ code
> (not the rest of GCC) to 100 columns?
> 
If you _do_ change it. I would suggest changing it to 120, which is next 
common step for a lot of C++ projects.

Often also with an allowance for overruns if that makes the code cleaner.

'Allan

Re: RFC: -fno-share-inlines

2020-08-24 Thread Allan Sandfeld Jensen

On Montag, 24. August 2020 08:52:04 CEST Richard Biener wrote:
> On Mon, Aug 10, 2020 at 9:36 AM Allan Sandfeld Jensen
> 
>  wrote:
> > Following the previous discussion, this is a proposal for a patch that
> > adds
> > the flag -fno-share-inlines that can be used when compiling singular
> > source
> > files with a different set of flags than the rest of the project.
> > 
> > It basically turns off comdat for inline functions, as if you compiled
> > without support for 'weak' symbols. Turning them all into "static"
> > functions, even if that wouldn't normally be possible for that type of
> > function. Not sure if it breaks anything, which is why I am not sending
> > it to the patch list.
> > 
> > I also considered alternatively to turn the comdat generation off later
> > during assembler production to ensure all processing and optimization of
> > comdat functions would occur as normal.
> 
> We already have -fvisibility-inlines-hidden so maybe call it
> -fvisibility-inlines-static?
That could make sense.

> Does this option also imply 'static' vtables?
> 
I don't think so. It affects functions declarations. It probably also affect 
virtual functions, but as far as I know the place changed only affects the 
functions not the virtual table. So there could still be a duplicated virtual 
table with the wrong methods, that then depends on linking if it will be used 
globally. I am not sure I dare change that, having different virtual tables 
for the same class depending on where it is allocated sounds like a way to 
break some things.

I wouldn't use virtual tables anyway when dealing with mixed architectures.

Best regards
Allan

Re: Peephole optimisation: isWhitespace()

2020-08-17 Thread Allan Sandfeld Jensen

On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> Hi @ll,
> 
> in his ACM queue article ,
> Matt Godbolt used the function
> 
> | bool isWhitespace(char c)
> | {
> | 
> | return c == ' '
> | 
> |   || c == '\r'
> |   || c == '\n'
> |   || c == '\t';
> | 
> | }
> 
> as an example, for which GCC 9.1 emits the following assembly for AMD64
> 
> processors (see ):
> |xoreax, eax  ; result = false
> |cmpdil, 32   ; is c > 32
> |ja .L4   ; if so, exit with false
> |movabs rax, 4294977024   ; rax = 0x12600
> |shrx   rax, rax, rdi ; rax >>= c
> |andeax, 1; result = rax & 1
> |
> |.L4:
> |ret
> 
No it doesn't. As your example shows if you took the time to read it, it is 
what gcc emit when generating code to run on a _haswell_ architecture. If you 
remove -march=haswell from the command line you get:

xor eax, eax
cmp dil, 32
ja  .L1
movabs  rax, 4294977024
mov ecx, edi
shr rax, cl
and eax, 1

It uses one mov more, but no shrx. 

'Allan

RFC: -fno-share-inlines

2020-08-10 Thread Allan Sandfeld Jensen

Following the previous discussion, this is a proposal for a patch that adds 
the flag -fno-share-inlines that can be used when compiling singular source 
files with a different set of flags than the rest of the project.

It basically turns off comdat for inline functions, as if you compiled without 
support for 'weak' symbols. Turning them all into "static" functions, even if 
that wouldn't normally be possible for that type of function. Not sure if it 
breaks anything, which is why I am not sending it to the patch list.

I also considered alternatively to turn the comdat generation off later during 
assembler production to ensure all processing and optimization of comdat 
functions would occur as normal.

Best regards
Allandiff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 2b1aca16eb4..78e1f592126 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1803,6 +1803,10 @@ frtti
 C++ ObjC++ Optimization Var(flag_rtti) Init(1)
 Generate run time type descriptor information.

+fshare-inlines
+C C++ ObjC ObjC++ Var(flag_share_inlines) Init(1)
+Emit non-inlined inlined declared functions to be shared between object files.
+
 fshort-enums
 C ObjC C++ ObjC++ LTO Optimization Var(flag_short_enums)
 Use the narrowest integer type possible for enumeration types.
diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c
index 33c83773d33..8de796d16fc 100644
--- a/gcc/cp/decl2.c
+++ b/gcc/cp/decl2.c
@@ -1957,10 +1957,9 @@ adjust_var_decl_tls_model (tree decl)
 void
 comdat_linkage (tree decl)
 {
-  if (flag_weak)
-make_decl_one_only (decl, cxx_comdat_group (decl));
-  else if (TREE_CODE (decl) == FUNCTION_DECL
-	   || (VAR_P (decl) && DECL_ARTIFICIAL (decl)))
+  if ((!flag_share_inlines || !flag_weak)
+  && (TREE_CODE (decl) == FUNCTION_DECL
+	  || (VAR_P (decl) && DECL_ARTIFICIAL (decl
 /* We can just emit function and compiler-generated variables
statically; having multiple copies is (for the most part) only
a waste of space.
@@ -1978,6 +1977,8 @@ comdat_linkage (tree decl)
should perform a string comparison, rather than an address
comparison.  */
 TREE_PUBLIC (decl) = 0;
+  else if (flag_weak)
+make_decl_one_only (decl, cxx_comdat_group (decl));
   else
 {
   /* Static data member template instantiations, however, cannot

Re: Non-inlined functions and mixed architectures

2020-08-04 Thread Allan Sandfeld Jensen

On Dienstag, 4. August 2020 19:44:57 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > On Montag, 27. Juli 2020 10:54:02 CEST Florian Weimer wrote:
> >> * Allan Sandfeld Jensen:
> >> > On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> >> >> * Allan Sandfeld Jensen:
> >> >> > A problem that I keep running into is functions defined headers, but
> >> >> > used
> >> >> > in sources files that are compiled with different CPU feature flags
> >> >> > (for
> >> >> > runtime CPU feature selection).
> >> >> > 
> >> >> > We know to make sure the functions are inlinable and their address
> >> >> > never
> >> >> > taken, but of course in debug builds they are still not inlined.
> >> >> > Every
> >> >> > so
> >> >> > often the functions get compiled using some of the optional CPU
> >> >> > instructions, and if the linker selects the optimized versions those
> >> >> > instructions can then leak through to instances compiled with
> >> >> > different
> >> >> > CPU flags where the instructions aren't supposed to be used. This
> >> >> > happens
> >> >> > even in unoptimized debug builds as the extended instruction
> >> >> > selections
> >> >> > doesn't count as an optimization.
> >> >> 
> >> >> You need to provide source code examples.  This isn't supposed to
> >> >> happen
> >> >> if you declare the functions as static inline.  If a function is
> >> >> emitted
> >> >> for any reason, it will be local this particular object file.
> >> >> 
> >> >> Plain inline (for C++) works differently and will attempt to share
> >> >> implementations.
> >> > 
> >> > static inline? Hadn't thought of that in a shared header file.
> >> > 
> >> > Is harder to do with inline methods in C++ classes though.
> >> 
> >> Ahh, and anonymous namespaces (the equivalent for that for member
> >> functions) do not work in such cases because the representation of the
> >> class still needs to be shared across API boundaries.  With an anonymous
> >> namspace, that would be undefined.
> > 
> > So, would it be possible to have a gcc extension or future C++ attribute
> > that worked like static on global functions but could be used on member
> > functions (both static and otherwise)?
> > 
> > Perhaps make it universal so it did the same no matter where it was used
> > instead of being contextual like 'static'.
> 
> One caveat is that things get somewhat interesting if such a function
> returns an object of static or thread storage duration.  In your
> application, you probably want to globalize such objects because they
> are all equivalent.  But there might be other cases where this is
> different.
> 
> vtables are tricky as well, but you probably avoid them in your
> scenario.
> 
Right vtables would be a different story completely. I guess it would only 
make sense for non-virtual inline declared methods, which means a universal 
attribute doesn't make sense.

Application controlled runtime CPU switching with C++ interfaces will remain 
an unreliable hack.

Thanks
Allan

Re: Non-inlined functions and mixed architectures

2020-08-04 Thread Allan Sandfeld Jensen

On Montag, 27. Juli 2020 10:54:02 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> >> * Allan Sandfeld Jensen:
> >> > A problem that I keep running into is functions defined headers, but
> >> > used
> >> > in sources files that are compiled with different CPU feature flags
> >> > (for
> >> > runtime CPU feature selection).
> >> > 
> >> > We know to make sure the functions are inlinable and their address
> >> > never
> >> > taken, but of course in debug builds they are still not inlined. Every
> >> > so
> >> > often the functions get compiled using some of the optional CPU
> >> > instructions, and if the linker selects the optimized versions those
> >> > instructions can then leak through to instances compiled with different
> >> > CPU flags where the instructions aren't supposed to be used. This
> >> > happens
> >> > even in unoptimized debug builds as the extended instruction selections
> >> > doesn't count as an optimization.
> >> 
> >> You need to provide source code examples.  This isn't supposed to happen
> >> if you declare the functions as static inline.  If a function is emitted
> >> for any reason, it will be local this particular object file.
> >> 
> >> Plain inline (for C++) works differently and will attempt to share
> >> implementations.
> > 
> > static inline? Hadn't thought of that in a shared header file.
> > 
> > Is harder to do with inline methods in C++ classes though.
> 
> Ahh, and anonymous namespaces (the equivalent for that for member
> functions) do not work in such cases because the representation of the
> class still needs to be shared across API boundaries.  With an anonymous
> namspace, that would be undefined.
> 
So, would it be possible to have a gcc extension or future C++ attribute that 
worked like static on global functions but could be used on member functions 
(both static and otherwise)?

Perhaps make it universal so it did the same no matter where it was used 
instead of being contextual like 'static'.

Best regards
'Allan

Re: Non-inlined functions and mixed architectures

2020-07-27 Thread Allan Sandfeld Jensen

On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > A problem that I keep running into is functions defined headers, but used
> > in sources files that are compiled with different CPU feature flags (for
> > runtime CPU feature selection).
> > 
> > We know to make sure the functions are inlinable and their address never
> > taken, but of course in debug builds they are still not inlined. Every so
> > often the functions get compiled using some of the optional CPU
> > instructions, and if the linker selects the optimized versions those
> > instructions can then leak through to instances compiled with different
> > CPU flags where the instructions aren't supposed to be used. This happens
> > even in unoptimized debug builds as the extended instruction selections
> > doesn't count as an optimization.
> 
> You need to provide source code examples.  This isn't supposed to happen
> if you declare the functions as static inline.  If a function is emitted
> for any reason, it will be local this particular object file.
> 
> Plain inline (for C++) works differently and will attempt to share
> implementations.
> 
static inline? Hadn't thought of that in a shared header file.

Is harder to do with inline methods in C++ classes though.

A recent example I hit into was methods using a qfloat16 class that 
specializes for F16C when available, see https://codereview.qt-project.org/c/
qt/qtbase/+/307772. Which I guess ought to be split into different classes 
with different constructors, so they don't violate ODR rules to be really safe 
across compilers.

But I guess a case like https://codereview.qt-project.org/c/qt/qtbase/+/308163 
could be solved with static inline instead.

Best regards
Allan

Non-inlined functions and mixed architectures

2020-07-23 Thread Allan Sandfeld Jensen

A problem that I keep running into is functions defined headers, but used in 
sources files that are compiled with different CPU feature flags (for runtime 
CPU feature selection).

We know to make sure the functions are inlinable and their address never 
taken, but of course in debug builds they are still not inlined. Every so 
often the functions get compiled using some of the optional CPU instructions, 
and if the linker selects the optimized versions those instructions can then 
leak through to instances compiled with different CPU flags where the 
instructions aren't supposed to be used. This happens even in unoptimized 
debug builds as the extended instruction selections doesn't count as an 
optimization.

So far the main workaround for gcc has been to mark the functions as 
always_inline.

I have been wondering if you couldn't use the same technique you used for fix 
similar problems for mixed archs for LTO builds and tag shared functions with 
their archs so they don't get merged by linker?

I know the whole thing could technially be seen as an ODR violation, but it 
would still be great if it was something GCC could just handle it out of the 
box.

Alternatively an compile time option to mark non-inline inline functions as 
weak or not generated at all would when compiling certain files would also 
work. 

Best regards
'Allan

Re: New x86-64 micro-architecture levels

2020-07-11 Thread Allan Sandfeld Jensen

On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:
> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
> 
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
> 
> I'll be glad if I can get any feedback on this proposal.  I plan to turn
> it into a merge request for the x86-64 psABI document eventually.
> 
Sounds good, though if I could dream I would also love a partial replacement 
option. So that you could have a generic x86-64 binary that only had some AVX2 
optimized replacement functions in a supplementary library.

Perhaps implemented by marked the library as a partial replacement, so the 
dynamic linker would also load the base or lower libraries except for 
functions already resolved.

You could also add a level E for the AVX512 instructions in ice lake and 
above. The VBMI1/2 instructions would likely be useful for autovectorization 
in GCC.

'Allan

Re: issue with behavior change of gcc -r between gcc-8 and gcc-9

2020-04-02 Thread Allan Sandfeld Jensen

On Wednesday, 1 April 2020 19:48:11 CEST Olivier Hainque wrote:
> 
> -r 's business was to arrange for the linker not to
> complain because the closure is incomplete, leaving us
> with complete control of the closure.
> 
> It doesn't seem to me there was a really strong motivation
> to suddenly have -r influence the closure the way it now does.
> 
> Would it be possible to revert to the previous behavior
> and document it ?
> 
> Or maybe allow it to be controllable by the target ports ?
> 
> Or provide something to bring back the flexibility we had
> if we really believe the default should change ? (I'm not
> convinced)

-r is used for relinking. The idea behind the change was to make it directly 
suitable for that. It takes object files and relinks them into a new object 
file. It gives the caller complete control.

It sounds like you are missing some way to add startfiles? A reverse of 
-nostartfiles?

But hopefully you can just use the linker directly? Unless you have LTO 
enabled object files you dont need the compiler to link.

`Allan

Re: Warning on move and dereference of unique_ptr in the same expression

2020-02-03 Thread Allan Sandfeld Jensen

On Montag, 3. Februar 2020 21:47:13 CET Marek Polacek wrote:
> On Mon, Feb 03, 2020 at 09:26:40PM +0100, Allan Sandfeld Jensen wrote:
> > Hello gcc
> > 
> > I have now twice hit obscure bugs in Chromium that crashed on some
> > compilers but not on others, and didn't produce any warnings on any
> > compiler. I would like to know if this code is as undefined as I think it
> > is, and if it would make sense to have gcc warn about it.
> > 
> > Both cases basically has this form:
> > 
> > std::unique_ptr a;
> > 
> > a->b->callMethod(something, bind(callback, std::move(a)));
> > 
> > This crashed with MSVC and gcc 5, but not with newer gcc or with clang.
> 
> You mean the application itself, not the compiler, presumably.
Of course.

> 
> > When it crashes it is because the arguments and the move therein have been
> > evaluated before a->b is resolved.
> > 
> > I assume this is undefined behavior? So why isn't the warning for using
> > and
> > modifying in the same expression triggered?
> 
> This should be defined in C++17, with P0145 in particular:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0145r3.pdf
> which says that the expression that names the function is sequenced before
> every argument expression and every default argument.
> 
Right thanks, that would explain why it worked consistently with the latest 
gcc versions. I guess it is more of a corner case to ask for it being warned 
about in --std=c++14 mode?

'Allan

Warning on move and dereference of unique_ptr in the same expression

2020-02-03 Thread Allan Sandfeld Jensen

Hello gcc

I have now twice hit obscure bugs in Chromium that crashed on some compilers 
but not on others, and didn't produce any warnings on any compiler. I would 
like to know if this code is as undefined as I think it is, and if it would 
make sense to have gcc warn about it.

Both cases basically has this form:

std::unique_ptr a;

a->b->callMethod(something, bind(callback, std::move(a)));

This crashed with MSVC and gcc 5, but not with newer gcc or with clang.

When it crashes it is because the arguments and the move therein have been 
evaluated before a->b is resolved.

I assume this is undefined behavior? So why isn't the warning for using and 
modifying in the same expression triggered?

Best regards
'Allan

Re: GCC Multi-Threading Ideas

2020-01-24 Thread Allan Sandfeld Jensen

On Freitag, 24. Januar 2020 17:29:06 CET Nicholas Krause wrote:
> On 1/24/20 3:18 AM, Allan Sandfeld Jensen wrote:
> > On Freitag, 24. Januar 2020 04:38:48 CET Nicholas Krause wrote:
> >> On 1/23/20 12:19 PM, Nicholas Krause wrote:
> >>> On 1/23/20 3:39 AM, Allan Sandfeld Jensen wrote:
> >>>> On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> >>>>> Greetings All,
> >>>>> 
> >>>>> Unfortunately due to me being rather busy with school and other
> >>>>> things I
> >>>>> will not be able to post my article to the wiki for awhile. However
> >>>>> there is a  rough draft here:
> >>>>> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIp
> >>>>> C9
> >>>>> gUMj
> >>>>> 
> >>>>> Oxk/edit that may change a little for people to read in the meantime.
> >>>> 
> >>>> This comment might not be suited for your project, but now that I
> >>>> think about
> >>>> it: If we want to improve gcc toolchain buildspeed with better
> >>>> multithreading.
> >>>> I think the most sensible would be fixing up gold multithreading and
> >>>> enabling
> >>>> it by default. We already get most of the benefits of multicore
> >>>> architectures
> >>>> by running multiple compile jobs in parallel (yes, I know you are
> >>>> focusing on
> >>>> cases where that for some reason doesn't work, but it is still the
> >>>> case in
> >>>> most situations). The main bottleneck is linking. The code is even
> >>>> already
> >>>> there in gold and have been for years, it just haven't been deemed
> >>>> ready for
> >>>> being enabled by default.
> >>>> 
> >>>> Is anyone even working on that?
> >>>> 
> >>>> Best regards
> >>>> Allan
> >>> 
> >>> Allan,
> >>> You would need both depending on the project, some are more compiler
> >>> bottle necked and others linker. I mentioned that issue at Cauldron as
> >>> the other side would be the linker.
> >>> 
> >>> Nick
> >> 
> >> Sorry for the second message Allan but make -j does not scale well
> >> beyond 4 or
> >> 8 threads and that's considering a 4 core or 8 machine.
> > 
> > It doesn't? I generally build with -j100, but then also use icecream to
> > distribute builds to multiple machines in the office. That probably also
> > makes linking times more significant to my case.
> > 
> > 'Allan
> 
> Allan,
> 
> I ran a gcc build on a machine with make -j32 and -j64 that had 64 cores.
> There was literally only a 4 minute increase in build speed. Good question
> through.
> 
Right. I guess it entirely depends on what you are building. If you are 
building gcc, it is probably bound by multiple configure runs, and separate 
iterations. What I usually build is Qt and Chromium, where thousands of files 
can be compiled from a single configure run (more than 2 in the case of 
Chromium), plus those configure runs are much faster. For Chromium there is 
almost a linear speed up with the number of parallel jobs you run up to around 
100. With -j100 I can build Chromium in 10 minutes, with 2 minutes being 
linking time (5 minutes linking if using bfd linker). With -j8 it takes 2 
hours.

But I guess that means multithreading the compiler can make sense to your 
case, even if it doesn't to mine.

Regards
'Allan

Re: GCC Multi-Threading Ideas

2020-01-24 Thread Allan Sandfeld Jensen

On Freitag, 24. Januar 2020 04:38:48 CET Nicholas Krause wrote:
> On 1/23/20 12:19 PM, Nicholas Krause wrote:
> > On 1/23/20 3:39 AM, Allan Sandfeld Jensen wrote:
> >> On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> >>> Greetings All,
> >>> 
> >>> Unfortunately due to me being rather busy with school and other
> >>> things I
> >>> will not be able to post my article to the wiki for awhile. However
> >>> there is a  rough draft here:
> >>> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9
> >>> gUMj
> >>> 
> >>> Oxk/edit that may change a little for people to read in the meantime.
> >> 
> >> This comment might not be suited for your project, but now that I
> >> think about
> >> it: If we want to improve gcc toolchain buildspeed with better
> >> multithreading.
> >> I think the most sensible would be fixing up gold multithreading and
> >> enabling
> >> it by default. We already get most of the benefits of multicore
> >> architectures
> >> by running multiple compile jobs in parallel (yes, I know you are
> >> focusing on
> >> cases where that for some reason doesn't work, but it is still the
> >> case in
> >> most situations). The main bottleneck is linking. The code is even
> >> already
> >> there in gold and have been for years, it just haven't been deemed
> >> ready for
> >> being enabled by default.
> >> 
> >> Is anyone even working on that?
> >> 
> >> Best regards
> >> Allan
> > 
> > Allan,
> > You would need both depending on the project, some are more compiler
> > bottle necked and others linker. I mentioned that issue at Cauldron as
> > the other side would be the linker.
> > 
> > Nick
> 
> Sorry for the second message Allan but make -j does not scale well
> beyond 4 or
> 8 threads and that's considering a 4 core or 8 machine. 

It doesn't? I generally build with -j100, but then also use icecream to 
distribute builds to multiple machines in the office. That probably also makes 
linking times more significant to my case.

'Allan

Re: GCC Multi-Threading Ideas

2020-01-23 Thread Allan Sandfeld Jensen

On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> Greetings All,
> 
> Unfortunately due to me being rather busy with school and other things I
> will not be able to post my article to the wiki for awhile. However
> there is a  rough draft here:
> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9gUMj
> Oxk/edit that may change a little for people to read in the meantime.
> 
This comment might not be suited for your project, but now that I think about 
it: If we want to improve gcc toolchain buildspeed with better multithreading. 
I think the most sensible would be fixing up gold multithreading and enabling 
it by default. We already get most of the benefits of multicore architectures 
by running multiple compile jobs in parallel (yes, I know you are focusing on 
cases where that for some reason doesn't work, but it is still the case in 
most situations). The main bottleneck is linking. The code is even already 
there in gold and have been for years, it just haven't been deemed ready for 
being enabled by default.

Is anyone even working on that?

Best regards
Allan

Re: C2X Proposal, merge '.' and '->' C operators

2019-12-21 Thread Allan Sandfeld Jensen

On Monday, 16 December 2019 14:51:38 CET J Decker wrote:
> Here's the gist of what I would propose...
> https://gist.github.com/d3x0r/f496d0032476ed8b6f980f7ed31280da
> 
> In C, there are two operators . and -> used to access members of struct and
> union types. These operators are specified such that they are always paired
> in usage; for example, if the left hand expression is a pointer to a struct
> or union, then the operator -> MUST be used. There is no occasion where .
> and -> may be interchanged, given the existing specification.
> 
> It should be very evident to the compiler whether the token before '.' or
> '->' is a pointer to a struct/union or a struct/union, and just build the
> appropriate output.
> 
> The source modification for the compiler is very slight, even depending on
> flag_c2x(that's not it's name).  It ends up changing a lot of existing
> lines, just to change their indentation; but that shouldn't really count
> against 'changed lines'.
> 
> I'm sure, after 4 score and some years ('78-19) that it must surely have
> come up before?  Anyone able to point me to those existing proposals?
> 
What if you operate on a pointer to a pointer to a struct? Should the same 
operator just magically dereference everything until it is a struct?

I disagree with this proposal because separate a thing and a pointer to a 
thing is fundamental to C/C++, and providing short-cuts that confuse the two 
is doing a disservice to anyone that needs to learn it.

Besides isn't this the wrong mailing list for this? 

'Allan

Re: [PATCH] Come up with -flto=auto option.

2019-07-24 Thread Allan Sandfeld Jensen

On Mittwoch, 24. Juli 2019 08:45:21 CEST Martin Liška wrote:
> On 7/24/19 12:11 AM, Allan Sandfeld Jensen wrote:
> > On Dienstag, 23. Juli 2019 10:30:07 CEST Martin Liška wrote:
> >> Hi.
> >> 
> >> As we as openSUSE started using -flto, I see it very handy to have
> >> an option value that will automatically detect number of cores
> >> that can be used for parallel LTRANS phase.
> >> 
> >> Thoughts?
> > 
> > That's really nice.
> > 
> > How much extra work would it be to make it support a posix make jobserver?
> 
> We do support it via -flto=jobserver:
> 
Good to know :)

> 
> Problem is that nowadays you how much more common make systems like ninja,
> meson and others that probably do not support that.
> 
There are patches to enable it in ninja, and I know some Linux distros apply 
the patches by default. Though that is more listening, so it probably requires 
launching ninja using make, if you want to be able to pass it own to gcc.

'Allan

Re: [PATCH] Come up with -flto=auto option.

2019-07-23 Thread Allan Sandfeld Jensen

On Dienstag, 23. Juli 2019 10:30:07 CEST Martin Liška wrote:
> Hi.
> 
> As we as openSUSE started using -flto, I see it very handy to have
> an option value that will automatically detect number of cores
> that can be used for parallel LTRANS phase.
> 
> Thoughts?
> 
That's really nice. 

How much extra work would it be to make it support a posix make jobserver? 

As far as I understand, you would need to guess a partition size first (as 
your patch here does), but then only start each job when given a token from 
the jobserver FD.

With that the integration to existing build infrastructure would be optimal.

Cheers
'Allan

Re: On -march=x86-64

2019-07-14 Thread Allan Sandfeld Jensen

On Donnerstag, 11. Juli 2019 20:58:04 CEST Allan Sandfeld Jensen wrote:
> Years ago I discovered Chrome was optimizing with  -march=x86-64, and
> knowing was an undocumented arch that would optimize for K8 I laughed at it
> and just removed that piece of idiocy from our fork of Chromium, so it
> would be faster than upstream. Recently though I noticed phoronix is also
> using now sometimes optimize with -march=x86-64 instead of using, well..
> nothing. as they should. And checking recent GCC documentation I noticed
> that in gcc 8 and gcc 9 documentation you are now documenting -march=x86-64
> and calling it a generic 64-bit processor. So now it is no longer a
> laughing matter that people mistake it for that.
> 
> I would suggest instead of fixing the documentation to say what
> -march=x86-64 actually does, that we should perhaps change it to do what
> people expect and make it an alias for generic?
> 
Nevermind, it was fixed back in 2007 when the generic architecture was 
introduced. The arch table is just misleading here.

Best regards
'Allan

[Patch] Make x86-64 a generic architecture as documented

2019-07-13 Thread Allan Sandfeld Jensen

Hello

Changing -march=x86-64 behaviour to match documentation and common usage.

Question: Is this also good for -m32 -march=x86-64 or will generic then mean 
something else?

Thank
Allan

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 364466b6b6f..90f9cbc3c35 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2019-07-13  Allan Sandfeld Jensen 
+
+   * gcc/common/config/i386/i386-common.c (processor_alias_table): Change
+   x86-64 architecture to use generic tuning and scheduling instead of 
K8.
+
 2019-07-11  Jakub Jelinek  

PR target/91124
diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/
i386-common.c
index a394f874fe4..19ace226190 100644
--- a/gcc/common/config/i386/i386-common.c
+++ b/gcc/common/config/i386/i386-common.c
@@ -1673,7 +1673,7 @@ const pta processor_alias_table[] =
 PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_FXSR},
   {"athlon-mp", PROCESSOR_ATHLON, CPU_ATHLON,
 PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_FXSR},
-  {"x86-64", PROCESSOR_K8, CPU_K8,
+  {"x86-64", PROCESSOR_GENERIC, CPU_GENERIC,
 PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_NO_SAHF | PTA_FXSR},
   {"eden-x2", PROCESSOR_K8, CPU_K8,
 PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_FXSR},

On -march=x86-64

2019-07-11 Thread Allan Sandfeld Jensen

Years ago I discovered Chrome was optimizing with  -march=x86-64, and knowing 
was an undocumented arch that would optimize for K8 I laughed at it and just 
removed that piece of idiocy from our fork of Chromium, so it would be faster 
than upstream. Recently though I noticed phoronix is also using now sometimes 
optimize with -march=x86-64 instead of using, well.. nothing. as they should. 
And checking recent GCC documentation I noticed that in gcc 8 and gcc 9 
documentation you are now documenting -march=x86-64 and calling it a generic 
64-bit processor. So now it is no longer a laughing matter that people mistake 
it for that. 

I would suggest instead of fixing the documentation to say what -march=x86-64 
actually does, that we should perhaps change it to do what people expect and 
make it an alias for generic?

Best regards
'Allan

Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen

On Freitag, 22. März 2019 14:38:10 CET Andrew Haley wrote:
> On 3/22/19 10:20 AM, Allan Sandfeld Jensen wrote:
> > On Freitag, 22. März 2019 11:02:39 CET Andrew Haley wrote:
> >> On 3/21/19 10:19 PM, Allan Sandfeld Jensen wrote:
> >>> From having fixed UBSAN warnings, I have seen many cases where undefined
> >>> behavior was performed, but where the code was aware of it and the final
> >>> result of the expression was well defined nonetheless.
> >> 
> >> Is this belief about undefined behaviour commonplace among C programmers?
> >> There's nothing in the standard to justify it: any expression which
> >> contains UB is undefined.
> > 
> > Yes, even GCC uses undefined behavior when it is considered defined for
> > specific architecture,
> 
> If it's defined for a specific architecture it's not undefined. Any compiler
> is entitled to do anything with UB, and "anything" includes extending the
> language to make it well defined.

True, but in the context of "things UBSAN warns about", that includes 
architecture specific details.

And isn't unaligned access real undefined behavior that just happens to work 
on x86 (and newer ARM)?

There are also stuff like type-punning unions which is not architecture 
specific, technically undefined, but which GCC explicitly tolerates (and needs 
to since some NEON intrinsics use it).

'Allan

Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen

On Freitag, 22. März 2019 11:02:39 CET Andrew Haley wrote:
> On 3/21/19 10:19 PM, Allan Sandfeld Jensen wrote:
> > From having fixed UBSAN warnings, I have seen many cases where undefined
> > behavior was performed, but where the code was aware of it and the final
> > result of the expression was well defined nonetheless.
> 
> Is this belief about undefined behaviour commonplace among C programmers?
> There's nothing in the standard to justify it: any expression which contains
> UB is undefined.

Yes, even GCC uses undefined behavior when it is considered defined for 
specific architecture, whether it be the result of unaligned access, negative 
shifts, etc. There is a lot of the warnings that UBSAN warns about that you 
will find both in GCC itself, the Linux kernel and many other places.

'Allan

Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen

On Donnerstag, 21. März 2019 23:31:48 CET Jakub Jelinek wrote:
> On Thu, Mar 21, 2019 at 11:19:54PM +0100, Allan Sandfeld Jensen wrote:
> > Hmm, I am curious. How strongly would gcc assume x is 0?
> 
> If x is not 0, then it is undefined behavior and anything can happen,
> so yes, it can assume x is 0, sometimes gcc does that, sometimes not,
> it is not required to do that.
> 
> > From having fixed UBSAN warnings, I have seen many cases where undefined
> > behavior was performed, but where the code was aware of it and the final
> 
> Any program where it printed something (talking about -fsanitize=undefined,
> not the few sanitizers that go beyond what is required by the language)
> is undefined, period.  It can happen to "work" as some users expect, it can
> crash, it can format your disk or anything else.  There is no well defined
> after a process runs into UB.
> 
That's nonsense and you know it. There are plenty of things that are undefined 
by the C standard that we rely on anyway.

But getting back to the question, well GCC carry such information further, and 
thus break code that is otherwise correct behaving on all known architectures, 
just because the C standard hasn't decided on one of two possible results?

'Allan

Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-21 Thread Allan Sandfeld Jensen

On Montag, 11. März 2019 10:14:49 CET Jakub Jelinek wrote:
> On Mon, Mar 11, 2019 at 08:49:30AM +, Moritz Strübe wrote:
> > Considering that C11 6.5.7#3 ("If  the  value  of  the  right operand 
> > is  negative  or  is greater than or equal to the width of the promoted
> > left operand, the behavior is undefined.") is not very widely known, as
> > it "normally" just works, inverting the intent is quite unexpected.
> > 
> > Is there any option that would have helped me with this?
> 
> You could build with -fsanitize=undefined, that would tell you at runtime
> you have undefined behavior in your code (if the SingleDiff has bit ever
> 0x20 set).
> 
> The fact that negative or >= bit precision shifts are UB is widely known,
> and even if it wouldn't, for the compiler all the UBs are just UBs, the
> compiler optimizes on the assumption that UB does not happen, so when it
> sees 32-bit int << (x & 32), it can assume x must be 0 at that point,
> anything else is UB.
> 
Hmm, I am curious. How strongly would gcc assume x is 0?

What if you have some expression that is undefined if x is not zero, but x 
really isn't zero and the result is temporarily undefined, but then another 
statement or part of the expression fixes the final result to something 
defined regardless of the intermediate. Would the compiler make assumptions 
that the intermediate value is never undefined, and possibly carry that 
analysed information over into other expressions?

>From having fixed UBSAN warnings, I have seen many cases where undefined 
behavior was performed, but where the code was aware of it and the final 
result of the expression was well defined nonetheless.

'Allan

Re: Enabling LTO for target libraries (e.g., libgo, libstdc++)

2019-01-25 Thread Allan Sandfeld Jensen

On Freitag, 25. Januar 2019 07:22:36 CET Nikhil Benesch wrote:
> Does anyone have advice to offer? Has anyone tried convincing the build
> system to compile some of the other target libraries (like libstdc++ or
> libgfortran) with -flto?
> 
Make sure the static versions of the libraries are partially linked before 
being archived so they do not have LTO code in the final libraries.

gcc -r  -o libname.o
ar x libname.a libname.o

Never done it with gcc libraries though.

'Allan

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-09-01 Thread Allan Sandfeld Jensen

On Montag, 27. August 2018 15:37:15 CEST Joseph Myers wrote:
> On Sun, 26 Aug 2018, Allan Sandfeld Jensen wrote:
> > Patch updated. I specifically edited a number of the existing tests that
> > used both -r and -nostdlib and removed -nostdlib so the patch is
> > exercised by existing tests. The patch bootstrapped, I didn't notice any
> > relevant failures when running the test suite (though I could have missed
> > something, I am never comfortable reading that output).
> 
> Note that Iain's comments also included that the patch is incomplete
> because of more specs in gcc.c (VTABLE_VERIFICATION_SPEC,
> SANITIZER_EARLY_SPEC, SANITIZER_SPEC) that needs corresponding updates to
> handle -r like -nostdlib.

Updated (but tests not rerun)>From 1d164bced7979c94767c260174e3c486d4fc8c5d Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Sat, 1 Sep 2018 12:59:14 +0200
Subject: [PATCH] Fix and document -r option

The option has existed and been working for years,
make sure it implies the right extra options, and list
it in the documentation.

2018-09-01 Allan Sandfeld Jensen 

gcc/doc/
* invoke.texi: Document -r.

gcc/
* gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib.
(VTABLE_VERIFICATION_SPEC): Ditto
(SANITIZER_EARLY_SPEC): Ditto
(SANITIZER_SPEC): Ditto
* config/darwin.h (LINK_COMMAND_SPEC): Ditto
* cp/g++spec.c (lang_specific_driver): Ditto
* fortran/gfortranspec.c (lang_specific_driver): Ditto
* go/gospec.c (lang_specific_driver): Ditto

gcc/testsuite/
* g++.dg/ipa/pr64059.C: Removed now redundant -nostdlib.
* g++.dg/lto/20081109-1_0.C: Ditto
* g++.dg/lto/20090302_0.C: Ditto
* g++.dg/lto/pr45621_0.C: Ditto
* g++.dg/lto/pr60567_0.C: Ditto
* g++.dg/lto/pr62026.C: Ditto
* gcc.dg/lto/pr45736_0.c: Ditto
* gcc.dg/lto/pr52634_0.c: Ditto
* gfortran.dg/lto/20091016-1_0.f90: Ditto
* gfortran.dg/lto/pr79108_0.f90: Ditto
---
 gcc/config/darwin.h|  8 
 gcc/cp/g++spec.c   |  1 +
 gcc/doc/invoke.texi|  7 ++-
 gcc/fortran/gfortranspec.c |  1 +
 gcc/gcc.c  | 18 +-
 gcc/go/gospec.c|  1 +
 gcc/testsuite/g++.dg/ipa/pr64059.C |  2 +-
 gcc/testsuite/g++.dg/lto/20081109-1_0.C|  2 +-
 gcc/testsuite/g++.dg/lto/20090302_0.C  |  2 +-
 gcc/testsuite/g++.dg/lto/pr45621_0.C   |  2 +-
 gcc/testsuite/g++.dg/lto/pr60567_0.C   |  2 +-
 gcc/testsuite/g++.dg/lto/pr62026.C |  2 +-
 gcc/testsuite/gcc.dg/lto/pr45736_0.c   |  2 +-
 gcc/testsuite/gcc.dg/lto/pr52634_0.c   |  2 +-
 gcc/testsuite/gfortran.dg/lto/20091016-1_0.f90 |  2 +-
 gcc/testsuite/gfortran.dg/lto/pr79108_0.f90|  2 +-
 16 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h
index cd6d6521658..87f610259c0 100644
--- a/gcc/config/darwin.h
+++ b/gcc/config/darwin.h
@@ -180,20 +180,20 @@ extern GTY(()) int darwin_ms_struct;
"%X %{s} %{t} %{Z} %{u*} \
 %{e*} %{r} \
 %{o*}%{!o:-o a.out} \
-%{!nostdlib:%{!nostartfiles:%S}} \
+%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \
 %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \
 %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \
 %{fgnu-tm: \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \
-%{!nostdlib:%{!nodefaultlibs:\
+%{!nostdlib:%{!r:%{!nodefaultlibs:\
   %{%:sanitize(address): -lasan } \
   %{%:sanitize(undefined): -lubsan } \
   %(link_ssp) \
   " DARWIN_EXPORT_DYNAMIC " %

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-09-01 Thread Allan Sandfeld Jensen

On Montag, 27. August 2018 15:37:15 CEST Joseph Myers wrote:
> On Sun, 26 Aug 2018, Allan Sandfeld Jensen wrote:
> > Patch updated. I specifically edited a number of the existing tests that
> > used both -r and -nostdlib and removed -nostdlib so the patch is
> > exercised by existing tests. The patch bootstrapped, I didn't notice any
> > relevant failures when running the test suite (though I could have missed
> > something, I am never comfortable reading that output).
> 
> Note that Iain's comments also included that the patch is incomplete
> because of more specs in gcc.c (VTABLE_VERIFICATION_SPEC,
> SANITIZER_EARLY_SPEC, SANITIZER_SPEC) that needs corresponding updates to
> handle -r like -nostdlib.

Okay, I can add that, or whoever commits the patch can add that. We can also 
improve the feature if we discover more places that needs updating. Do you 
want me to post an version updated with with these two places? 

'Allan

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-08-26 Thread Allan Sandfeld Jensen

On Donnerstag, 23. August 2018 23:24:02 CEST Joseph Myers wrote:
> On Thu, 23 Aug 2018, Iain Sandoe wrote:
> > Joseph: As a side-comment, is there a reason that we don’t exclude
> > gomp/itm/fortran/gcov from the link for -nostdlib / -nodefaultlib?
> > 
> > If we are relying on the lib self-specs for this, then we’re not
> > succeeding since the one we build at the moment don’t include those
> > clauses.
> 
> Well, fortran/gfortranspec.c for example has
> 
> case OPT_nostdlib:
> case OPT_nodefaultlibs:
> case OPT_c:
> case OPT_S:
> case OPT_fsyntax_only:
> case OPT_E:
>   /* These options disable linking entirely or linking of the
>  standard libraries.  */
>   library = 0;
>   break;
> 
> and only uses libgfortran.spec if (library).  So it's certainly meant to
> avoid linking with libgfortran or its dependencies if -nostdlib.

Patch updated. I specifically edited a number of the existing tests that used 
both -r and -nostdlib and removed -nostdlib so the patch is exercised by 
existing tests. The patch bootstrapped, I didn't notice any relevant failures 
when running the test suite (though I could have missed something, I am never
comfortable reading that output).

'Allan

>From 07ed41a9afd107c5d45feb1ead7a74ca735a1bb2 Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Sun, 26 Aug 2018 20:02:54 +0200
Subject: [PATCH] Fix and document -r option

The option has existed and been working for years,
make sure it implies the right extra options, and list
it in the documentation.

2018-08-26 Allan Sandfeld Jensen 

gcc/doc/
* invoke.texi: Document -r.

gcc/
* gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib.
* config/darwin.h (LINK_COMMAND_SPEC): Handle -r like -nostdlib.
* cp/g++spec.c (lang_specific_driver): Handle -r like -nostdlib.
* fortran/gfortranspec.c (lang_specific_driver): Handle -r like -nostdlib.
* go/gospec.c (lang_specific_driver): Handle -r like -nostdlib.

gcc/testsuite/
* g++.dg/ipa/pr64059.C: Removed now redundant -nostdlib.
* g++.dg/lto/20081109-1_0.C: Removed now redundant -nostdlib.
* g++.dg/lto/20090302_0.C: Removed now redundant -nostdlib.
* g++.dg/lto/pr45621_0.C: Removed now redundant -nostdlib.
* g++.dg/lto/pr60567_0.C: Removed now redundant -nostdlib.
* g++.dg/lto/pr62026.C: Removed now redundant -nostdlib.
* gcc.dg/lto/pr45736_0.c: Removed now redundant -nostdlib.
* gcc.dg/lto/pr52634_0.c: Removed now redundant -nostdlib.
* gfortran.dg/lto/20091016-1_0.f90: Removed now redundant -nostdlib.
* gfortran.dg/lto/pr79108_0.f90: Removed now redundant -nostdlib.

---
 gcc/config/darwin.h| 8 
 gcc/cp/g++spec.c   | 1 +
 gcc/doc/invoke.texi| 7 ++-
 gcc/fortran/gfortranspec.c | 1 +
 gcc/gcc.c  | 6 +++---
 gcc/go/gospec.c| 1 +
 gcc/testsuite/g++.dg/ipa/pr64059.C | 2 +-
 gcc/testsuite/g++.dg/lto/20081109-1_0.C| 2 +-
 gcc/testsuite/g++.dg/lto/20090302_0.C  | 2 +-
 gcc/testsuite/g++.dg/lto/pr45621_0.C   | 2 +-
 gcc/testsuite/g++.dg/lto/pr60567_0.C   | 2 +-
 gcc/testsuite/g++.dg/lto/pr62026.C | 2 +-
 gcc/testsuite/gcc.dg/lto/pr45736_0.c   | 2 +-
 gcc/testsuite/gcc.dg/lto/pr52634_0.c   | 2 +-
 gcc/testsuite/gfortran.dg/lto/20091016-1_0.f90 | 2 +-
 gcc/testsuite/gfortran.dg/lto/pr79108_0.f90| 2 +-
 16 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h
index cd6d6521658..87f610259c0 100644
--- a/gcc/config/darwin.h
+++ b/gcc/config/darwin.h
@@ -180,20 +180,20 @@ extern GTY(()) int darwin_ms_struct;
"%X %{s} %{t} %{Z} %{u*} \
 %{e*} %{r} \
 %{o*}%{!o:-o a.out} \
-%{!nostdlib:%{!nostartfiles:%S}} \
+%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \
 %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \
 %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \
 %{fgnu-tm: \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \
-%{!nostdlib:%{!nodefaultlibs:\
+%{!nostdlib:%{!r:%{!nodefaultlibs:\
   %{%:sanitize(address): -lasan } \
   %{%:sanitize(undefined): -lubsan } \
   %(link_ssp) \
   " DARWIN_EXPORT_DYNAMIC " %

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-08-26 Thread Allan Sandfeld Jensen

On Dienstag, 21. August 2018 00:38:58 CEST Joseph Myers wrote:
> On Fri, 3 Aug 2018, Allan Sandfeld Jensen wrote:
> > > I think you're changing the wrong place for this.  If you want -r to be
> > > usable with GCC without using -nostdlib (which is an interesting
> > > question), you actually need to change LINK_COMMAND_SPEC (also sometimes
> > > overridden for targets) to handle -r more like -nostdlib -nostartfiles.
> > 
> > Okay, so like this?
> 
> Could you confirm if this has passed a bootstrap and testsuite run, with
> no testsuite regressions compared to GCC without the patch applied?  I
> think it looks mostly OK (modulo ChangeLog rewrites and a missing second
> space after '.' in the manual change) but I'd like to make sure it's
> passed the usual testing before preparing it for commit.

I didn't think of running the tests since it only affects command line 
options, but it did bootstrap, and behave as expected for my specific usecase.

I will update and run the tests when I have time.

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-11 Thread Allan Sandfeld Jensen

On Samstag, 11. August 2018 11:18:39 CEST Jakub Jelinek wrote:
> On Sat, Aug 11, 2018 at 10:59:26AM +0200, Allan Sandfeld Jensen wrote:
> > +/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
> > +   using movss or movsd.  */
> > +static bool
> > +expand_vec_perm_movs (struct expand_vec_perm_d *d)
> > +{
> > +  machine_mode vmode = d->vmode;
> > +  unsigned i, nelt = d->nelt;
> > +  rtx x;
> > +
> > +  if (d->one_operand_p)
> > +return false;
> > +
> > +  if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode))
> > +;
> > +  else
> > +return false;
> > +
> > +  /* Only the first element is changed. */
> 
> Two spaces after .
> 
> > +  if (d->perm[0] != nelt && d->perm[0] != 0)
> > +return false;
> > +  for (i = 1; i < nelt; ++i) {
> > +{
> > +  if (d->perm[i] != i + nelt - d->perm[0])
> > +return false;
> > +}
> > +  }
> 
> Extraneous {}s (both pairs, the outer ones even badly indented).
> 
> Otherwise LGTM.
> 
Updated:

Note as an infrequent contributor don't have commit access, so I need someone 
reviewing to also commit.

'Allan
>From e33241e5ddc7fa57c4ba7893669af7f7e636125e Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Sat, 11 Aug 2018 11:52:21 +0200
Subject: [PATCH] Match movss and movsd "blend" instructions

Adds the ability to match movss and movsd as blend patterns,
implemented in a new method to be able to match these before shuffles,
while keeping other blends after.

2018-08-11 Allan Sandfeld Jensen 

gcc/config/i386

* i386.cc (expand_vec_perm_movs): New method matching movs
patterns.
* i386.cc (expand_vec_perm_1): Try the new method.

gcc/testsuite

* gcc.target/i386/sse2-movs.c: New test.
---
 gcc/config/i386/emmintrin.h |  2 +-
 gcc/config/i386/i386.c  | 41 +
 gcc/config/i386/xmmintrin.h |  5 -
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h
index b940a39d27b..6501638f619 100644
--- a/gcc/config/i386/emmintrin.h
+++ b/gcc/config/i386/emmintrin.h
@@ -113,7 +113,7 @@ _mm_setzero_pd (void)
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
+  return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1});
 }
 
 /* Load two DPFP values from P.  The address must be 16-byte aligned.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7554fd1f659..15a3caa94c3 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -46145,6 +46145,43 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1,
   return ok;
 }
 
+/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
+   using movss or movsd.  */
+static bool
+expand_vec_perm_movs (struct expand_vec_perm_d *d)
+{
+  machine_mode vmode = d->vmode;
+  unsigned i, nelt = d->nelt;
+  rtx x;
+
+  if (d->one_operand_p)
+return false;
+
+  if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode))
+;
+  else
+return false;
+
+  /* Only the first element is changed.  */
+  if (d->perm[0] != nelt && d->perm[0] != 0)
+return false;
+  for (i = 1; i < nelt; ++i)
+if (d->perm[i] != i + nelt - d->perm[0])
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  if (d->perm[0] == nelt)
+x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1));
+  else
+x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1));
+
+  emit_insn (gen_rtx_SET (d->target, x));
+
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
in terms of blendp[sd] / pblendw / pblendvb / vpblendd.  */
 
@@ -46887,6 +46924,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
 	}
 }
 
+  /* Try movss/movsd instructions.  */
+  if (expand_vec_perm_movs (d))
+return true;
+
   /* Finally, try the fully general two operand permute.  */
   if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt,
 			  d->testing_p))
diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h
index f64f3f74a0b..f770570295c 100644
--- a/gcc/config/i386/xmmintrin.h
+++ b/gcc/config/i386/xmmintrin.h
@@ -1011,7 +1011,10 @@ _mm_storer_ps (float *__P, __m128 __A)
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_shuffle ((__v4sf)__A, (__v4sf)__B,
+ __extension__
+ (__attribute__((__vector_size__ (16))) int)
+ {4,1,2,3});
 }
 
 /* Extracts one of the four words of A.  The selector N must be immediate.  */
-- 
2.17.1

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-11 Thread Allan Sandfeld Jensen

Updated:

Match movss and movsd "blend" instructions

Adds the ability to match movss and movsd as blend patterns,
implemented in a new method to be able to match these before shuffles,
while keeping other blends after.

2018-08-11 Allan Sandfeld Jensen 

gcc/config/i386

* i386.cc (expand_vec_perm_movs): New method matching movs
patterns.
* i386.cc (expand_vec_perm_1): Try the new method.

gcc/testsuite

* gcc.target/i386/sse2-movs.c: New test.
diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h
index b940a39d27b..6501638f619 100644
--- a/gcc/config/i386/emmintrin.h
+++ b/gcc/config/i386/emmintrin.h
@@ -113,7 +113,7 @@ _mm_setzero_pd (void)
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
+  return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1});
 }
 
 /* Load two DPFP values from P.  The address must be 16-byte aligned.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7554fd1f659..485850096e9 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -46145,6 +46145,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1,
   return ok;
 }
 
+/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
+   using movss or movsd.  */
+static bool
+expand_vec_perm_movs (struct expand_vec_perm_d *d)
+{
+  machine_mode vmode = d->vmode;
+  unsigned i, nelt = d->nelt;
+  rtx x;
+
+  if (d->one_operand_p)
+return false;
+
+  if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode))
+;
+  else
+return false;
+
+  /* Only the first element is changed. */
+  if (d->perm[0] != nelt && d->perm[0] != 0)
+return false;
+  for (i = 1; i < nelt; ++i) {
+{
+  if (d->perm[i] != i + nelt - d->perm[0])
+return false;
+}
+  }
+
+  if (d->testing_p)
+return true;
+
+  if (d->perm[0] == nelt)
+x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1));
+  else
+x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1));
+
+  emit_insn (gen_rtx_SET (d->target, x));
+
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
in terms of blendp[sd] / pblendw / pblendvb / vpblendd.  */
 
@@ -46887,6 +46927,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
 	}
 }
 
+  /* Try movss/movsd instructions.  */
+  if (expand_vec_perm_movs (d))
+return true;
+
   /* Finally, try the fully general two operand permute.  */
   if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt,
 			  d->testing_p))
diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h
index f64f3f74a0b..f770570295c 100644
--- a/gcc/config/i386/xmmintrin.h
+++ b/gcc/config/i386/xmmintrin.h
@@ -1011,7 +1011,10 @@ _mm_storer_ps (float *__P, __m128 __A)
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_shuffle ((__v4sf)__A, (__v4sf)__B,
+ __extension__
+ (__attribute__((__vector_size__ (16))) int)
+ {4,1,2,3});
 }
 
 /* Extracts one of the four words of A.  The selector N must be immediate.  */

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-08-11 Thread Allan Sandfeld Jensen

On Freitag, 3. August 2018 13:56:12 CEST Allan Sandfeld Jensen wrote:
> On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote:
> > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> > > gcc/
> > > 
> > > * gcc.c: Correct default specs for -r
> > 
> > I don't follow why your changes (which would need describing for each
> > individual spec changed) are corrections.
> > 
> > >  /* config.h can define LIB_SPEC to override the default libraries.  */
> > >  #ifndef LIB_SPEC
> > > 
> > > -#define LIB_SPEC "%{!shared:%{g*:-lg}
> > > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC
> > > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}">
> > > 
> > >  #endif
> > 
> > '!' binds more closely than '|' in specs.  That is, !shared|!r means the
> > following specs are used unless both -shared and -r are specified, which
> > seems nonsensical to me.  I'd expect something more like "shared|r:;" to
> > expand to nothing if either -shared or -r is passed and to what follows if
> > neither is passed.
> > 
> > And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant,
> > as it's generally overridden by targets - and normally for targets using
> > ELF shared libraries, for example, -lc *does* have to be used when linking
> > with -shared.
> > 
> > I think you're changing the wrong place for this.  If you want -r to be
> > usable with GCC without using -nostdlib (which is an interesting
> > question), you actually need to change LINK_COMMAND_SPEC (also sometimes
> > overridden for targets) to handle -r more like -nostdlib -nostartfiles.
> 
> Okay, so like this?


Any further comments or corrections to updated patch in the parent message?

'Allan

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-08-03 Thread Allan Sandfeld Jensen

On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote:
> On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> > gcc/
> > 
> > * gcc.c: Correct default specs for -r
> 
> I don't follow why your changes (which would need describing for each
> individual spec changed) are corrections.
> 
> >  /* config.h can define LIB_SPEC to override the default libraries.  */
> >  #ifndef LIB_SPEC
> > 
> > -#define LIB_SPEC "%{!shared:%{g*:-lg}
> > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC
> > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"> 
> >  #endif
> 
> '!' binds more closely than '|' in specs.  That is, !shared|!r means the
> following specs are used unless both -shared and -r are specified, which
> seems nonsensical to me.  I'd expect something more like "shared|r:;" to
> expand to nothing if either -shared or -r is passed and to what follows if
> neither is passed.
> 
> And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant,
> as it's generally overridden by targets - and normally for targets using
> ELF shared libraries, for example, -lc *does* have to be used when linking
> with -shared.
> 
> I think you're changing the wrong place for this.  If you want -r to be
> usable with GCC without using -nostdlib (which is an interesting
> question), you actually need to change LINK_COMMAND_SPEC (also sometimes
> overridden for targets) to handle -r more like -nostdlib -nostartfiles.
> 
Okay, so like this?
>From 9de68f2ef0b77a0c0bcf0d83232e7fc34b006406 Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Wed, 1 Aug 2018 18:07:05 +0200
Subject: [PATCH] Fix and document -r option

The option has existed and been working for years,
make sure it implies the right extra options, and list
it in the documentation.

2018-07-29 Allan Sandfeld Jensen 

gcc/doc

* invoke.texi: Document -r

gcc/
* gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib
* config/darwin.h (LINK_COMMAND_SPEC): Handle -r like -nostdlib
---
 gcc/config/darwin.h | 8 
 gcc/doc/invoke.texi | 7 ++-
 gcc/gcc.c   | 6 +++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h
index 980ad9b4057..6ad657a4958 100644
--- a/gcc/config/darwin.h
+++ b/gcc/config/darwin.h
@@ -178,20 +178,20 @@ extern GTY(()) int darwin_ms_struct;
"%X %{s} %{t} %{Z} %{u*} \
 %{e*} %{r} \
 %{o*}%{!o:-o a.out} \
-%{!nostdlib:%{!nostartfiles:%S}} \
+%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \
 %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \
 %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \
 %{fgnu-tm: \
   %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \
-%{!nostdlib:%{!nodefaultlibs:\
+%{!nostdlib:%{!r:%{!nodefaultlibs:\
   %{%:sanitize(address): -lasan } \
   %{%:sanitize(undefined): -lubsan } \
   %(link_ssp) \
   " DARWIN_EXPORT_DYNAMIC " %

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-02 Thread Allan Sandfeld Jensen

On Donnerstag, 2. August 2018 23:15:28 CEST Marc Glisse wrote:
> On Thu, 2 Aug 2018, Allan Sandfeld Jensen wrote:
> > I forgot. One of the things that makes using __builtin_shuffle ugly is
> > that
> > __v4si  as the suffle argument needs to be in _mm_move_ss, is declared
> > in emmintrin.h, but _mm_move_ss is in xmmintrin.h.
> 
> __v4si is some internal detail, I don't see much issue with moving it to
> xmmintrin.h if you want to use it there.
> 
> > In general the gcc __builtin_shuffle syntax with the argument being a
> > vector is kind of ackward. At least for the declaring intrinsics, the
> > clang still where the permutator is extra argument is easier to deal
> > with:
> > __builtin_shuffle(a, b, (__v4si){4, 0, 1, 2})
> > vs
> > __builtin_shuffle(a, b, 4, 0, 1, 2)
> 
> __builtin_shufflevector IIRC
> 
> >> The question is what users expect and get when they use -O0 with
> >> intrinsics?> 
> > Here is the version with __builtin_shuffle. It might be more expectable
> > -O0, but it is also uglier.
> 
> I am not convinced -O0 is very important.
> 
Me neither, and in any case I would argue the logic that recognizes the vector 
constructions patterns are not optimizations but instruction matching.

> If you start extending your approach to _mm_add_sd and others, while one
> instruction is easy enough to recognize, if we put several in a row, they
> will be partially simplified and may become harder to recognize.
> { x*(y+v[0]-z), v[1] } requires that you notice that the upper part of
> this vector is v[1], i.e. the upper part of a vector whose lower part
> appears somewhere in the arbitrarily complex expression for the lower
> part of the result. And you then have to propagate the fact that you are
> doing vector operations all the way back to v[0].
> 
> I don't have a strong opinion on what the best approach is.

Yes, I am not sure all of those could be done exhaustively with the existing 
logic, and it might also be of dubious value as in almost all cases the ps 
instructions have the same latency and bandwidth as the ss instructions, so 
developers should probably use _ps versions as they are scheduled better by 
the compiler (or at least better by gcc).
It was just an idea, and I haven't tried it at this point.

'Allan

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-02 Thread Allan Sandfeld Jensen

On Donnerstag, 2. August 2018 23:46:37 CEST Jakub Jelinek wrote:
> On Thu, Aug 02, 2018 at 10:50:58PM +0200, Allan Sandfeld Jensen wrote:
> > Here is the version with __builtin_shuffle. It might be more expectable
> > -O0, but it is also uglier.
> 
> I don't find anything ugly on it, except the formatting glitches (missing
> space before (, overlong line, and useless __extension__.
> Improving code generated for __builtin_shuffle is desirable too.
> 

__extension__ is needed when using the the {...} initialization otherwise -
std=C89 will produce warnings about standards.  The line is a bit long, but I 
thought it looked better like this rather than adding any emergency line 
breaks. Is there a hard limit?

> > --- a/gcc/config/i386/xmmintrin.h
> > +++ b/gcc/config/i386/xmmintrin.h
> > @@ -1011,7 +1011,8 @@ _mm_storer_ps (float *__P, __m128 __A)
> > 
> >  extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
> >  __artificial__)) _mm_move_ss (__m128 __A, __m128 __B)
> >  {
> > 
> > -  return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B);
> > +  return __extension__ (__m128) __builtin_shuffle((__v4sf)__A,
> > (__v4sf)__B, + 
> > (__attribute__((__vector_size__ (16))) int){4, 1, 2, 3});
> And obviously use __v4si here instead of __attribute__((__vector_size__
> (16))) int.
> 
__v4si is declared in emmintrin.h, so I couldn't use it here unless I moved 
the definition. I tried changing as little as possible to not trigger bike 
shedding.

'Allan

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-02 Thread Allan Sandfeld Jensen

On Donnerstag, 2. August 2018 11:18:41 CEST Richard Biener wrote:
> On Thu, Aug 2, 2018 at 11:12 AM Allan Sandfeld Jensen
> 
>  wrote:
> > On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote:
> > > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> > > >  extern __inline __m128d __attribute__((__gnu_inline__,
> > > >  __always_inline__,
> > > > 
> > > > __artificial__))
> > > > 
> > > >  _mm_move_sd (__m128d __A, __m128d __B)
> > > >  {
> > > > 
> > > > -  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
> > > > +  return __extension__ (__m128d)(__v2df){__B[0],__A[1]};
> > > > 
> > > >  }
> > > 
> > > If the goal is to have it represented as a VEC_PERM_EXPR internally, I
> > > wonder if we should be explicit and use __builtin_shuffle instead of
> > > relying on some forwprop pass to transform it. Maybe not, just asking.
> > > And
> > > the answer need not even be the same for _mm_move_sd and _mm_move_ss.
> > 
> > I wrote it this way because this pattern could later also be used for the
> > other _ss intrinsics, such as _mm_add_ss, where a _builtin_shuffle could
> > not. To match the other intrinsics the logic that tries to match vector
> > construction just needs to be extended to try merge patterns even if one
> > of the subexpressions is not simple.
> 
> The question is what users expect and get when they use -O0 with intrinsics?
> 
> Richard.
> 
Here is the version with __builtin_shuffle. It might be more expectable -O0, 
but it is also uglier.

diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h
index b940a39d27b..6501638f619 100644
--- a/gcc/config/i386/emmintrin.h
+++ b/gcc/config/i386/emmintrin.h
@@ -113,7 +113,7 @@ _mm_setzero_pd (void)
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
+  return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1});
 }
 
 /* Load two DPFP values from P.  The address must be 16-byte aligned.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ee409cfe7e4..2337ef5ea08 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -46143,6 +46143,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1,
   return ok;
 }
 
+/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
+   using movss or movsd.  */
+static bool
+expand_vec_perm_movs (struct expand_vec_perm_d *d)
+{
+  machine_mode vmode = d->vmode;
+  unsigned i, nelt = d->nelt;
+  rtx x;
+
+  if (d->one_operand_p)
+return false;
+
+  if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode))
+;
+  else
+return false;
+
+  /* Only the first element is changed. */
+  if (d->perm[0] != nelt && d->perm[0] != 0)
+return false;
+  for (i = 1; i < nelt; ++i) {
+{
+  if (d->perm[i] != i + nelt - d->perm[0])
+return false;
+}
+  }
+
+  if (d->testing_p)
+return true;
+
+  if (d->perm[0] == nelt)
+x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1));
+  else
+x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1));
+
+  emit_insn (gen_rtx_SET (d->target, x));
+
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
in terms of blendp[sd] / pblendw / pblendvb / vpblendd.  */
 
@@ -46885,6 +46925,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
 	}
 }
 
+  /* Try movss/movsd instructions.  */
+  if (expand_vec_perm_movs (d))
+return true;
+
   /* Finally, try the fully general two operand permute.  */
   if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt,
 			  d->testing_p))
diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h
index f64f3f74a0b..45b99ff87d5 100644
--- a/gcc/config/i386/xmmintrin.h
+++ b/gcc/config/i386/xmmintrin.h
@@ -1011,7 +1011,8 @@ _mm_storer_ps (float *__P, __m128 __A)
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B);
+  return __extension__ (__m128) __builtin_shuffle((__v4sf)__A, (__v4sf)__B,
+  (__attribute__((__vector_size__ (16))) int){4, 1, 2, 3});
 }
 
 /* Extracts one of the four words of A.  The selector N must be immediate.  */

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-02 Thread Allan Sandfeld Jensen

On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote:
> On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> >  extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__,
> > 
> > __artificial__))
> > 
> >  _mm_move_sd (__m128d __A, __m128d __B)
> >  {
> > 
> > -  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
> > +  return __extension__ (__m128d)(__v2df){__B[0],__A[1]};
> > 
> >  }
> 
> If the goal is to have it represented as a VEC_PERM_EXPR internally, I
> wonder if we should be explicit and use __builtin_shuffle instead of
> relying on some forwprop pass to transform it. Maybe not, just asking. And
> the answer need not even be the same for _mm_move_sd and _mm_move_ss.

I forgot. One of the things that makes using __builtin_shuffle ugly is that 
__v4si  as the suffle argument needs to be in _mm_move_ss, is declared
in emmintrin.h, but _mm_move_ss is in xmmintrin.h.

In general the gcc __builtin_shuffle syntax with the argument being a vector 
is kind of ackward. At least for the declaring intrinsics, the clang still 
where the permutator is extra argument is easier to deal with:
__builtin_shuffle(a, b, (__v4si){4, 0, 1, 2})
 vs
 __builtin_shuffle(a, b, 4, 0, 1, 2)

Re: [PATCH][x86] Match movss and movsd "blend" instructions

2018-08-02 Thread Allan Sandfeld Jensen

On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote:
> On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> >  extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__,
> > 
> > __artificial__))
> > 
> >  _mm_move_sd (__m128d __A, __m128d __B)
> >  {
> > 
> > -  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
> > +  return __extension__ (__m128d)(__v2df){__B[0],__A[1]};
> > 
> >  }
> 
> If the goal is to have it represented as a VEC_PERM_EXPR internally, I
> wonder if we should be explicit and use __builtin_shuffle instead of
> relying on some forwprop pass to transform it. Maybe not, just asking. And
> the answer need not even be the same for _mm_move_sd and _mm_move_ss.

I wrote it this way because this pattern could later also be used for the 
other _ss intrinsics, such as _mm_add_ss, where a _builtin_shuffle could not. 
To match the other intrinsics the logic that tries to match vector 
construction just needs to be extended to try merge patterns even if one of 
the subexpressions is not simple.

'Allan

Re: [Patch][GCC] Document and fix -r (partial linking)

2018-08-01 Thread Allan Sandfeld Jensen

On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote:
> On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote:
> > gcc/
> > 
> > * gcc.c: Correct default specs for -r
> 
> I don't follow why your changes (which would need describing for each
> individual spec changed) are corrections.
> 
> >  /* config.h can define LIB_SPEC to override the default libraries.  */
> >  #ifndef LIB_SPEC
> > 
> > -#define LIB_SPEC "%{!shared:%{g*:-lg}
> > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC
> > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"> 
> >  #endif
> 
> '!' binds more closely than '|' in specs.  That is, !shared|!r means the
> following specs are used unless both -shared and -r are specified, which
> seems nonsensical to me.  I'd expect something more like "shared|r:;" to
> expand to nothing if either -shared or -r is passed and to what follows if
> neither is passed.
> 
> And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant,
> as it's generally overridden by targets - and normally for targets using
> ELF shared libraries, for example, -lc *does* have to be used when linking
> with -shared.
> 
> I think you're changing the wrong place for this.  If you want -r to be
> usable with GCC without using -nostdlib (which is an interesting
> question), you actually need to change LINK_COMMAND_SPEC (also sometimes
> overridden for targets) to handle -r more like -nostdlib -nostartfiles.
> 
Ok, thanks for the information, I will investigate that. 

> > -#define LINK_PIE_SPEC "%{static|shared|r:;" PIE_SPEC ":" LD_PIE_SPEC "} "
> > +#define LINK_PIE_SPEC "%{static|shared|r|ar:;" PIE_SPEC ":" LD_PIE_SPEC
> > "} "
> What's this "-ar" option you're handling here?

Dead code from a previous more ambitious version of the patch. I will remove.

`Allan

[Patch][GCC] Document and fix -r (partial linking)

2018-08-01 Thread Allan Sandfeld Jensen

The option has existed and been working for years,
make sure it implies the right extra options, and list
it in the documentation.

2018-08-01 Allan Sandfeld Jensen 

gcc/doc

* invoke.texi: Document -r

gcc/
* gcc.c: Correct default specs for -r
---
 gcc/doc/invoke.texi | 7 ++-
 gcc/gcc.c   | 6 +++---
 2 files changed, 9 insertions(+), 4 deletions(-)>From 638966e6c7e072ca46c6af0664fbd57bedbfff80 Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Wed, 1 Aug 2018 18:07:05 +0200
Subject: [PATCH] Fix and document -r option

The option has existed and been working for years,
make sure it implies the right extra options, and list
it in the documentation.

2018-07-29 Allan Sandfeld Jensen 

gcc/doc

* invoke.texi: Document -r

gcc/
* gcc.c: Correct default specs for -r
---
 gcc/doc/invoke.texi | 7 ++-
 gcc/gcc.c   | 6 +++---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 6047d82065a..7da30bd9d99 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -518,7 +518,7 @@ Objective-C and Objective-C++ Dialects}.
 @xref{Link Options,,Options for Linking}.
 @gccoptlist{@var{object-file-name}  -fuse-ld=@var{linker}  -l@var{library} @gol
 -nostartfiles  -nodefaultlibs  -nolibc  -nostdlib @gol
--pie  -pthread  -rdynamic @gol
+-pie  -pthread  -r  -rdynamic @gol
 -s  -static -static-pie -static-libgcc  -static-libstdc++ @gol
 -static-libasan  -static-libtsan  -static-liblsan  -static-libubsan @gol
 -shared  -shared-libgcc  -symbolic @gol
@@ -12444,6 +12444,11 @@ x86 Cygwin and MinGW targets.  On some targets this option also sets
 flags for the preprocessor, so it should be used consistently for both
 compilation and linking.
 
+@item -r
+@opindex r
+Produce a relocatable object as output. This is also known as partial
+linking.
+
 @item -rdynamic
 @opindex rdynamic
 Pass the flag @option{-export-dynamic} to the ELF linker, on targets
diff --git a/gcc/gcc.c b/gcc/gcc.c
index 780d4859ef3..858a5600c14 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -675,7 +675,7 @@ proper position among the other output files.  */
 
 /* config.h can define LIB_SPEC to override the default libraries.  */
 #ifndef LIB_SPEC
-#define LIB_SPEC "%{!shared:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"
+#define LIB_SPEC "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"
 #endif
 
 /* When using -fsplit-stack we need to wrap pthread_create, in order
@@ -797,7 +797,7 @@ proper position among the other output files.  */
 /* config.h can define STARTFILE_SPEC to override the default crt0 files.  */
 #ifndef STARTFILE_SPEC
 #define STARTFILE_SPEC  \
-  "%{!shared:%{pg:gcrt0%O%s}%{!pg:%{p:mcrt0%O%s}%{!p:crt0%O%s}}}"
+  "%{!shared|!r:%{pg:gcrt0%O%s}%{!pg:%{p:mcrt0%O%s}%{!p:crt0%O%s}}}"
 #endif
 
 /* config.h can define ENDFILE_SPEC to override the default crtn files.  */
@@ -936,7 +936,7 @@ proper position among the other output files.  */
 #else
 #define LD_PIE_SPEC ""
 #endif
-#define LINK_PIE_SPEC "%{static|shared|r:;" PIE_SPEC ":" LD_PIE_SPEC "} "
+#define LINK_PIE_SPEC "%{static|shared|r|ar:;" PIE_SPEC ":" LD_PIE_SPEC "} "
 #endif
 
 #ifndef LINK_BUILDID_SPEC
-- 
2.17.0

[PATCH][x86] Match movss and movsd "blend" instructions

2018-08-01 Thread Allan Sandfeld Jensen

Adds the ability to match movss and movsd as blend patterns,
implemented in a new method to be able to match these before shuffles,
while keeping other blends after.

2018-07-29 Allan Sandfeld Jensen 

gcc/config/i386

* i386.cc (expand_vec_perm_movs): New method matching movs
patterns.
* i386.cc (expand_vec_perm_1): Try the new method.

gcc/testsuite

* gcc.target/i386/sse2-movs.c: New test.
---
 gcc/config/i386/emmintrin.h   |  2 +-
 gcc/config/i386/i386.c| 44 +++
 gcc/config/i386/xmmintrin.h   |  2 +-
 gcc/testsuite/gcc.target/i386/sse2-movs.c | 21 +++
 4 files changed, 67 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/sse2-movs.c
>From e96b3aa9017ad0d19238c923146196405cc4e5af Mon Sep 17 00:00:00 2001
From: Allan Sandfeld Jensen 
Date: Wed, 9 May 2018 12:35:14 +0200
Subject: [PATCH] Match movss and movsd blends

Adds the ability to match movss and movsd as blend patterns,
implemented in a new method to be able to match these before shuffles,
while keeping other blends after.

2018-07-29 Allan Sandfeld Jensen 

gcc/config/i386

* i386.cc (expand_vec_perm_movs): New method matching movs
patterns.
* i386.cc (expand_vec_perm_1): Try the new method.

gcc/testsuite

* gcc.target/i386/sse2-movs.c: New test.
---
 gcc/config/i386/emmintrin.h   |  2 +-
 gcc/config/i386/i386.c| 44 +++
 gcc/config/i386/xmmintrin.h   |  2 +-
 gcc/testsuite/gcc.target/i386/sse2-movs.c | 21 +++
 4 files changed, 67 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/sse2-movs.c

diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h
index b940a39d27b..1efd943bac4 100644
--- a/gcc/config/i386/emmintrin.h
+++ b/gcc/config/i386/emmintrin.h
@@ -113,7 +113,7 @@ _mm_setzero_pd (void)
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
+  return __extension__ (__m128d)(__v2df){__B[0],__A[1]};
 }
 
 /* Load two DPFP values from P.  The address must be 16-byte aligned.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ee409cfe7e4..2337ef5ea08 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -46143,6 +46143,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1,
   return ok;
 }
 
+/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
+   using movss or movsd.  */
+static bool
+expand_vec_perm_movs (struct expand_vec_perm_d *d)
+{
+  machine_mode vmode = d->vmode;
+  unsigned i, nelt = d->nelt;
+  rtx x;
+
+  if (d->one_operand_p)
+return false;
+
+  if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode))
+;
+  else
+return false;
+
+  /* Only the first element is changed. */
+  if (d->perm[0] != nelt && d->perm[0] != 0)
+return false;
+  for (i = 1; i < nelt; ++i) {
+{
+  if (d->perm[i] != i + nelt - d->perm[0])
+return false;
+}
+  }
+
+  if (d->testing_p)
+return true;
+
+  if (d->perm[0] == nelt)
+x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1));
+  else
+x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1));
+
+  emit_insn (gen_rtx_SET (d->target, x));
+
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to implement D
in terms of blendp[sd] / pblendw / pblendvb / vpblendd.  */
 
@@ -46885,6 +46925,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
 	}
 }
 
+  /* Try movss/movsd instructions.  */
+  if (expand_vec_perm_movs (d))
+return true;
+
   /* Finally, try the fully general two operand permute.  */
   if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt,
 			  d->testing_p))
diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h
index f64f3f74a0b..699f681e054 100644
--- a/gcc/config/i386/xmmintrin.h
+++ b/gcc/config/i386/xmmintrin.h
@@ -1011,7 +1011,7 @@ _mm_storer_ps (float *__P, __m128 __A)
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B);
+  return __extension__ (__m128)(__v4sf){__B[0],__A[1],__A[2],__A[3]};
 }
 
 /* Extracts one of the four words of A.  The selector N must be immediate.  */
diff --git a/gcc/testsuite/gcc.target/i386/sse2-movs.c b/gcc/testsuite/gcc.target/i386/sse2-movs.c
new file mode 100644
index 000..79f486cfa82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse2-movs.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2" } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-final { scan-assembler "movss" } } */
+/* { dg-final {

Re: O2 Agressive Optimisation by GCC

2018-07-22 Thread Allan Sandfeld Jensen

On Sonntag, 22. Juli 2018 17:01:29 CEST Umesh Kalappa wrote:
> Allan ,
> 
> >>he might as well go traditional
> 
> you mean using the locks ?
> 

No I am meant relying on undefined behavior. In your case I would recommend 
using modern atomics, which is defined behavior, and modern and fast. I was 
just reminded of all the nasty and theoretically wrong ways we used to do 
stuff like that 20 years ago to implement fallback locks. For instance using -
O0, asm-declarations, relying on non-inlined functions calls as memory-
barriers, etc. All stuff that "worked", but relied on various degrees of 
undefined behavior.

Still if you are curious, it might be fun playing with stuff like that, and 
try to figure for yourself why it works, just remember it is undefined 
behavior and therefore not recommended.

'Allan

Re: O2 Agressive Optimisation by GCC

2018-07-20 Thread Allan Sandfeld Jensen

On Samstag, 21. Juli 2018 00:21:48 CEST Jonathan Wakely wrote:
> On Fri, 20 Jul 2018 at 23:06, Allan Sandfeld Jensen wrote:
> > On Freitag, 20. Juli 2018 14:19:12 CEST Umesh Kalappa wrote:
> > > Hi All ,
> > > 
> > > We are looking at the C sample i.e
> > > 
> > > extern int i,j;
> > > 
> > > int test()
> > > {
> > > while(1)
> > > {   i++;
> > > 
> > > j=20;
> > > 
> > > }
> > > return 0;
> > > }
> > > 
> > > command used :(gcc 8.1.0)
> > > gcc -S test.c -O2
> > > 
> > > the generated asm for x86
> > > 
> > > .L2:
> > > jmp .L2
> > > 
> > > we understand that,the infinite loop is not  deterministic ,compiler
> > > is free to treat as that as UB and do aggressive optimization ,but we
> > > need keep the side effects like j=20 untouched by optimization .
> > > 
> > > Please note that using the volatile qualifier for i and j  or empty
> > > asm("") in the while loop,will stop the optimizer ,but we don't want
> > > do  that.
> > 
> > But you need to do that! If you want changes to a variable to be
> > observable in another thread, you need to use either volatile,
> 
> No, volatile doesn't work for that.
> 
It does, but you shouldn't use for that due to many other reasons (though the 
linux kernel still does) But if the guy wants to code primitive without using 
system calls or atomics, he might as well go traditional

'Allan

Re: O2 Agressive Optimisation by GCC

2018-07-20 Thread Allan Sandfeld Jensen

On Freitag, 20. Juli 2018 14:19:12 CEST Umesh Kalappa wrote:
> Hi All ,
> 
> We are looking at the C sample i.e
> 
> extern int i,j;
> 
> int test()
> {
> while(1)
> {   i++;
> j=20;
> }
> return 0;
> }
> 
> command used :(gcc 8.1.0)
> gcc -S test.c -O2
> 
> the generated asm for x86
> 
> .L2:
> jmp .L2
> 
> we understand that,the infinite loop is not  deterministic ,compiler
> is free to treat as that as UB and do aggressive optimization ,but we
> need keep the side effects like j=20 untouched by optimization .
> 
> Please note that using the volatile qualifier for i and j  or empty
> asm("") in the while loop,will stop the optimizer ,but we don't want
> do  that.
> 
But you need to do that! If you want changes to a variable to be observable in 
another thread, you need to use either volatile, atomic, or some kind of 
memory barrier implicit or explicit. This is the same if the loop wasn't 
infinite, the compiler would keep the value in register during the loop and 
only write it to memory on exiting the test() function.

'Allan

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-31 Thread Allan Sandfeld Jensen

Okay, I think I can withdraw the suggestion. It is apparently not providing a 
stable end performance.

I would like to end with sharing the measurements I made that motivated me to 
suggest the change. Hopefully it can be useful if tree-slp-vectorize gets 
improved and the suggestion comes up again.

As I said previously, the benchmarks I ran were not affected, probably because 
most thing we benchmark in Qt often is hand-optimized already, but the binary 
size with tree-slp-vectorize was on average a one or two percent smaller, 
though was not universal, and many smaller libraries were unaffected.



gcc-8 version 8.1.0 (Debian 8.1.0-4)
gcc-7 version 7.3.0 (Debian 7.3.0-19)

Qt version 5.11.0 (edited to override selective use of -O3)

  size  library

g++-7 -march=corei7 -O2
 8015632 libQt5Widgets.so.5.11.0
 6194288 libQt5Gui.so.5.11.0
  760016 libQt5DBus.so.5.11.0
 5603160 libQt5Core.so.5.11.0

g++-7 -march=corei7 -O2 -ftree-slp-vectorize
 8007440 libQt5Widgets.so.5.11.0
 6182000 libQt5Gui.so.5.11.0
  760016 libQt5DBus.so.5.11.0
 5603224 libQt5Core.so.5.11.0

g++-8 -O2
 8062520 libQt5Widgets.so.5.11.0
 6232160 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5848528 libQt5Core.so.5.11.0

g++-8 -O2 -ftree-slp-vectorize
 8058424 libQt5Widgets.so.5.11.0
 6219872 libQt5Gui.so.5.11.0
  769680 libQt5DBus.so.5.11.0
 5844560 libQt5Core.so.5.11.0

g++-8 -march=corei7 -O2
 8062520 libQt5Widgets.so.5.11.0
 6215584 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 580 libQt5Core.so.5.11.0

g++-8 -march=corei7 -O2 -ftree-slp-vectorize
 8046136 libQt5Widgets.so.5.11.0
 6191008 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5840472 libQt5Core.so.5.11.0

g++-8 -march=haswell -O2
 8046136 libQt5Widgets.so.5.11.0
 6170408 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5852448 libQt5Core.so.5.11.0

g++-8 -march=haswell -O2 -ftree-slp-vectorize
 8046136 libQt5Widgets.so.5.11.0
 6158120 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5848480 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os
 6990368 libQt5Widgets.so.5.11.0
 5030616 libQt5Gui.so.5.11.0
  624160 libQt5DBus.so.5.11.0
 4847056 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -ftree-slp-vectorize
 6986272 libQt5Widgets.so.5.11.0
 5018328 libQt5Gui.so.5.11.0
  624160 libQt5DBus.so.5.11.0
 4847120 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -flto
 6785760 libQt5Widgets.so.5.11.0
 4844464 libQt5Gui.so.5.11.0
  593488 libQt5DBus.so.5.11.0
 4688432 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -flto -ftree-slp-vectorize
 6777568 libQt5Widgets.so.5.11.0
 4836272 libQt5Gui.so.5.11.0
  593488 libQt5DBus.so.5.11.0
 4688472 libQt5Core.so.5.11.0

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-29 Thread Allan Sandfeld Jensen

On Dienstag, 29. Mai 2018 16:57:56 CEST Richard Biener wrote:
>
> so the situation improves but isn't fully fixed (STLF issues maybe?)
> 

That raises the question if it helps in these cases even in -O3? 

Anyway it doesn't look good for it. Did the binary size at least improve with 
prefer-avx128, or was that also worse or insignificant?

'Allan

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-28 Thread Allan Sandfeld Jensen

On Montag, 28. Mai 2018 12:58:20 CEST Richard Biener wrote:
> compile-time effects of the patch on that. Embedded folks may want to rhn
> their favorite benchmark and report results as well.
> 
> So I did a -O2 -march=haswell [-ftree-slp-vectorize] SPEC CPU 2006 compile
> and run and the compile-time
> effect where measurable (SPEC records on a second granularity) is within
> one second per benchmark
> apart from 410.bwaves (from 3s to 5s)  and 481.wrf (76s to 78s).
> Performance-wise I notice significant
> slowdowns for SPEC FP and some for SPEC INT (I only did a train run
> sofar).  I'll re-run with ref input now
> and will post those numbers.
> 
If you continue to see slowdowns, could you check with either no avx, or with 
-mprefer-avx128? The occational AVX256 instructions might be downclocking the 
CPU. But yes that would be a problem for this change on its own.

'Allan

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-27 Thread Allan Sandfeld Jensen

On Sonntag, 27. Mai 2018 03:23:36 CEST Segher Boessenkool wrote:
> On Sun, May 27, 2018 at 01:25:25AM +0200, Allan Sandfeld Jensen wrote:
> > On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> > > On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > > > I brought this subject up earlier, and was told to suggest it again
> > > > for
> > > > gcc 9, so I have attached the preliminary changes.
> > > > 
> > > > My studies have show that with generic x86-64 optimization it reduces
> > > > binary size with around 0.5%, and when optimizing for x64 targets with
> > > > SSE4 or better, it reduces binary size by 2-3% on average. The
> > > > performance changes are negligible however*, and I haven't been able
> > > > to
> > > > detect changes in compile time big enough to penetrate general noise
> > > > on
> > > > my platform, but perhaps someone has a better setup for that?
> > > > 
> > > > * I believe that is because it currently works best on non-optimized
> > > > code,
> > > > it is better at big basic blocks doing all kinds of things than
> > > > tightly
> > > > written inner loops.
> > > > 
> > > > Anythhing else I should test or report?
> > > 
> > > What does it do on other architectures?
> > 
> > I believe NEON would do the same as SSE4, but I can do a check. For
> > architectures without SIMD it essentially does nothing.
> 
> Sorry, I wasn't clear.  What does it do to performance on other
> architectures?  Is it (almost) always a win (or neutral)?  If not, it
> doesn't belong in -O2, not for the generic options at least.
> 
It shouldnt have any way of making slower code, so it is neutral or a win in 
performance, and similarly in code size, merged instructions means fewer 
instructions.

I never found a benchmark where it really made a measurable difference in 
performance, but I found many large binaries such as Qt or Chromium, where it 
made the binaries a few percent smaller.

Allan

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen

On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > I brought this subject up earlier, and was told to suggest it again for
> > gcc 9, so I have attached the preliminary changes.
> > 
> > My studies have show that with generic x86-64 optimization it reduces
> > binary size with around 0.5%, and when optimizing for x64 targets with
> > SSE4 or better, it reduces binary size by 2-3% on average. The
> > performance changes are negligible however*, and I haven't been able to
> > detect changes in compile time big enough to penetrate general noise on
> > my platform, but perhaps someone has a better setup for that?
> > 
> > * I believe that is because it currently works best on non-optimized code,
> > it is better at big basic blocks doing all kinds of things than tightly
> > written inner loops.
> > 
> > Anythhing else I should test or report?
> 
> What does it do on other architectures?
> 
> 
I believe NEON would do the same as SSE4, but I can do a check. For 
architectures without SIMD it essentially does nothing.

'Allan

Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen

I brought this subject up earlier, and was told to suggest it again for gcc 9, 
so I have attached the preliminary changes.

My studies have show that with generic x86-64 optimization it reduces binary 
size with around 0.5%, and when optimizing for x64 targets with SSE4 or 
better, it reduces binary size by 2-3% on average. The performance changes are 
negligible however*, and I haven't been able to detect changes in compile time 
big enough to penetrate general noise on my platform, but perhaps someone has 
a better setup for that?

* I believe that is because it currently works best on non-optimized code, it 
is better at big basic blocks doing all kinds of things than tightly written 
inner loops.

Anythhing else I should test or report?

Best regards
'Allan


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index beba295bef5..05851229354 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7612,6 +7612,7 @@ also turns on the following optimization flags:
 -fstore-merging @gol
 -fstrict-aliasing @gol
 -ftree-builtin-call-dce @gol
+-ftree-slp-vectorize @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -fcode-hoisting @gol
 -ftree-pre @gol
@@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following 
optimization flags:
 -floop-interchange @gol
 -floop-unroll-and-jam @gol
 -fsplit-paths @gol
--ftree-slp-vectorize @gol
 -fvect-cost-model @gol
 -ftree-partial-pre @gol
 -fpeel-loops @gol
@@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is 
enabled by default at
 @item -ftree-slp-vectorize
 @opindex ftree-slp-vectorize
 Perform basic block vectorization on trees. This flag is enabled by default 
at
-@option{-O3} and when @option{-ftree-vectorize} is enabled.
+@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled.
 
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
diff --git a/gcc/opts.c b/gcc/opts.c
index 33efcc0d6e7..11027b847e8 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -523,6 +523,7 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
@@ -539,7 +540,6 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
-{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC 
},
 { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },

Re: [Patch] Request for comments: short int builtins

2018-05-20 Thread Allan Sandfeld Jensen

On Sonntag, 20. Mai 2018 15:07:59 CEST Richard Biener wrote:
> On May 20, 2018 11:01:54 AM GMT+02:00, Allan Sandfeld Jensen 
<li...@carewolf.com> wrote:
> >A little over a year back we had a regression in a point release of gcc
> >
> >because the builtin __builtin_clzs got removed from i386, in part
> >because it
> >is was wrongly named for a target specific builtin, but we were using
> >it in Qt
> >since it existed in multiple compilers. I got the patch removing it
> >partially
> >reverted and the problem solved, but in the meantime I had worked on a
> >patch
> >to make it a generic builtin instead. I have rebased it and added it
> >below,
> >should I clean it up futher, finish the other builtins add tests and
> >propose
> >it, or is this not something we want?
> 
> Can't users simply do clz((unsigned short) s) - 16? GCC should be able to
> handle this for instruction selection With The addition of some folding
> patterns using the corresponding internal function.
> 
Of course, but we already have the builtin for i386, and a version of the 
builtin for all integer types except short for all platforms. Note the patch 
also generally adds short versions for all the general integer builtins, not 
just clzs and they are not all that trivial to synthesize (without knowing the 
trick, which gcc does).


'Allan

[Patch] Request for comments: short int builtins

2018-05-20 Thread Allan Sandfeld Jensen

A little over a year back we had a regression in a point release of gcc 
because the builtin __builtin_clzs got removed from i386, in part because it 
is was wrongly named for a target specific builtin, but we were using it in Qt 
since it existed in multiple compilers. I got the patch removing it partially 
reverted and the problem solved, but in the meantime I had worked on a patch 
to make it a generic builtin instead. I have rebased it and added it below, 
should I clean it up futher, finish the other builtins add tests and propose 
it, or is this not something we want?

diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 5365befd351..74a84e653e4 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -232,6 +232,8 @@ DEF_FUNCTION_TYPE_1 (BT_FN_INT_LONG, BT_INT, BT_LONG)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_ULONG, BT_INT, BT_ULONG)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_LONGLONG, BT_INT, BT_LONGLONG)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_ULONGLONG, BT_INT, BT_ULONGLONG)
+DEF_FUNCTION_TYPE_1 (BT_FN_INT_INT16, BT_INT, BT_INT16)
+DEF_FUNCTION_TYPE_1 (BT_FN_INT_UINT16, BT_INT, BT_UINT16)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_INTMAX, BT_INT, BT_INTMAX)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_UINTMAX, BT_INT, BT_UINTMAX)
 DEF_FUNCTION_TYPE_1 (BT_FN_INT_PTR, BT_INT, BT_PTR)
diff --git a/gcc/builtins.c b/gcc/builtins.c
index 9a2bf8c7d38..a8ad00e8016 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -10689,14 +10689,17 @@ is_inexpensive_builtin (tree decl)
   case BUILT_IN_CLZIMAX:
   case BUILT_IN_CLZL:
   case BUILT_IN_CLZLL:
+  case BUILT_IN_CLZS:
   case BUILT_IN_CTZ:
   case BUILT_IN_CTZIMAX:
   case BUILT_IN_CTZL:
   case BUILT_IN_CTZLL:
+  case BUILT_IN_CTZS:
   case BUILT_IN_FFS:
   case BUILT_IN_FFSIMAX:
   case BUILT_IN_FFSL:
   case BUILT_IN_FFSLL:
+  case BUILT_IN_FFSS:
   case BUILT_IN_IMAXABS:
   case BUILT_IN_FINITE:
   case BUILT_IN_FINITEF:
@@ -10734,10 +10737,12 @@ is_inexpensive_builtin (tree decl)
   case BUILT_IN_POPCOUNTL:
   case BUILT_IN_POPCOUNTLL:
   case BUILT_IN_POPCOUNTIMAX:
+  case BUILT_IN_POPCOUNTS:
   case BUILT_IN_POPCOUNT:
   case BUILT_IN_PARITYL:
   case BUILT_IN_PARITYLL:
   case BUILT_IN_PARITYIMAX:
+  case BUILT_IN_PARITYS:
   case BUILT_IN_PARITY:
   case BUILT_IN_LABS:
   case BUILT_IN_LLABS:
diff --git a/gcc/builtins.def b/gcc/builtins.def
index 449d08d682f..618ee798767 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -848,15 +848,18 @@ DEF_GCC_BUILTIN(BUILT_IN_CLZ, "clz", 
BT_FN_INT_UINT, ATTR_CONST_NOTHROW_
 DEF_GCC_BUILTIN(BUILT_IN_CLZIMAX, "clzimax", BT_FN_INT_UINTMAX, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLZL, "clzl", BT_FN_INT_ULONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLZLL, "clzll", BT_FN_INT_ULONGLONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN(BUILT_IN_CLZS, "clzs", BT_FN_INT_UINT16, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CONSTANT_P, "constant_p", BT_FN_INT_VAR, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CTZ, "ctz", BT_FN_INT_UINT, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CTZIMAX, "ctzimax", BT_FN_INT_UINTMAX, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CTZL, "ctzl", BT_FN_INT_ULONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CTZLL, "ctzll", BT_FN_INT_ULONGLONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN(BUILT_IN_CTZS, "ctzs", BT_FN_INT_UINT16, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLRSB, "clrsb", BT_FN_INT_INT, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLRSBIMAX, "clrsbimax", BT_FN_INT_INTMAX, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLRSBL, "clrsbl", BT_FN_INT_LONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN(BUILT_IN_CLRSBLL, "clrsbll", BT_FN_INT_LONGLONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN(BUILT_IN_CLRSBS, "clrsbs", BT_FN_INT_INT16, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_DCGETTEXT, "dcgettext", 
BT_FN_STRING_CONST_STRING_CONST_STRING_INT, ATTR_FORMAT_ARG_2)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_DGETTEXT, "dgettext", 
BT_FN_STRING_CONST_STRING_CONST_STRING, ATTR_FORMAT_ARG_2)
 DEF_GCC_BUILTIN(BUILT_IN_DWARF_CFA, "dwarf_cfa", BT_FN_PTR, 
ATTR_NULL)
@@ -878,6 +881,7 @@ DEF_EXT_LIB_BUILTIN(BUILT_IN_FFS, "ffs", 
BT_FN_INT_INT, ATTR_CONST_NOTHROW_L
 DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSIMAX, "ffsimax", BT_FN_INT_INTMAX, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSL, "ffsl", BT_FN_INT_LONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSLL, "ffsll", BT_FN_INT_LONGLONG, 
ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSS, "ffss", BT_FN_INT_INT16, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_FORK, "fork", BT_FN_PID,

Re: [PATCH] Use two source permute for vector initialization (PR 85692, take 2)

2018-05-10 Thread Allan Sandfeld Jensen

On Donnerstag, 10. Mai 2018 09:57:22 CEST Jakub Jelinek wrote:
> On Wed, May 09, 2018 at 04:53:19PM +0200, Allan Sandfeld Jensen wrote:
> > > > @@ -2022,8 +2022,9 @@ simplify_vector_constructor
> > > > (gimple_stmt_iterator
> > > > *gsi)>
> > > > 
> > > >elem_type = TREE_TYPE (type);
> > > >elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type));
> > > > 
> > > > -  vec_perm_builder sel (nelts, nelts, 1);
> > > > -  orig = NULL;
> > > > +  vec_perm_builder sel (nelts, 2, nelts);
> > > 
> > > Why this change?  I admit the vec_parm_builder arguments are confusing,
> > > but
> > > I think the second times third is the number of how many indices are
> > > being
> > > pushed into the vector, so I think (nelts, nelts, 1) is right.
> > 
> > I had the impression it was what was selected from. In any case, I changed
> > it because without I get crash when vec_perm_indices is created later
> > with a possible nparms of 2.
> 
> The documentation is apparently in vector-builder.h:
>This class is a wrapper around auto_vec for building vectors of T.
>It aims to encode each vector as npatterns interleaved patterns,
>where each pattern represents a sequence:
> 
>  { BASE0, BASE1, BASE1 + STEP, BASE1 + STEP*2, BASE1 + STEP*3, ... }
> 
>The first three elements in each pattern provide enough information
>to derive the other elements.  If all patterns have a STEP of zero,
>we only need to encode the first two elements in each pattern.
>If BASE1 is also equal to BASE0 for all patterns, we only need to
>encode the first element in each pattern.  The number of encoded
>elements per pattern is given by nelts_per_pattern.
> 
>The class can be used in two ways:
> 
>1. It can be used to build a full image of the vector, which is then
>   canonicalized by finalize ().  In this case npatterns is initially
>   the number of elements in the vector and nelts_per_pattern is
>   initially 1.
> 
>2. It can be used to build a vector that already has a known encoding.
>   This is preferred since it is more efficient and copes with
>   variable-length vectors.  finalize () then canonicalizes the encoding
>   to a simpler form if possible.
> 
> As the vector is constant width and we are building the full image of the
> vector, the right arguments are (nelts, nelts, 1) as per 1. above, and the
> finalization can perhaps change it to something more compact.
> 
> > > (and sorry for missing your patch first, the PR wasn't ASSIGNED and
> > > there
> > > was no link to gcc-patches for it).
> > 
> > It is okay. You are welcome to take it over. I am not a regular gcc
> > contributor and thus not well-versed in the details, only the basic logic
> > of how things work.
> 
> Ok, here is my version of the patch.  Bootstrapped/regtested on x86_64-linux
> and i686-linux, ok for trunk?
> 
Looks good to me if that counts for anything.

'Allan

Re: [Patch] Use two source permute for vector initialization (PR 85692)

2018-05-09 Thread Allan Sandfeld Jensen

On Mittwoch, 9. Mai 2018 11:08:02 CEST Jakub Jelinek wrote:
> On Tue, May 08, 2018 at 01:25:35PM +0200, Allan Sandfeld Jensen wrote:
> > 2018-05-08 Allan Sandfeld Jensen <allan.jen...@qt.io>
> 
> 2 spaces between date and name and two spaces between name and email
> address.
> 
> > gcc/
> > 
> > PR tree-optimization/85692
> > * tree-ssa-forwprop.c (simplify_vector_constructor): Try two
> > source permute as well.
> > 
> > gcc/testsuite
> > 
> > * gcc.target/i386/pr85692.c: Test two simply constructions are
> > detected as permute instructions.
> 
> Just
>   * gcc.target/i386/pr85692.c: New test.
> 
> > diff --git a/gcc/testsuite/gcc.target/i386/pr85692.c
> > b/gcc/testsuite/gcc.target/i386/pr85692.c new file mode 100644
> > index 000..322c1050161
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr85692.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -msse4.1" } */
> > +/* { dg-final { scan-assembler "unpcklps" } } */
> > +/* { dg-final { scan-assembler "blendps" } } */
> > +/* { dg-final { scan-assembler-not "shufps" } } */
> > +/* { dg-final { scan-assembler-not "unpckhps" } } */
> > +
> > +typedef float v4sf __attribute__ ((vector_size (16)));
> > +
> > +v4sf unpcklps(v4sf a, v4sf b)
> > +{
> > +return v4sf{a[0],b[0],a[1],b[1]};
> 
> Though, not really sure if this has been tested at all.
> The above is valid only in C++ (and only C++11 and above), while the
> test is compiled as C and thus has to fail.
>
Yes, I thought it had been tested, but it wasn't. It also needs to change the 
first line to be a compile and not run test.

> > @@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator
> > *gsi)> 
> >elem_type = TREE_TYPE (type);
> >elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type));
> > 
> > -  vec_perm_builder sel (nelts, nelts, 1);
> > -  orig = NULL;
> > +  vec_perm_builder sel (nelts, 2, nelts);
> 
> Why this change?  I admit the vec_parm_builder arguments are confusing, but
> I think the second times third is the number of how many indices are being
> pushed into the vector, so I think (nelts, nelts, 1) is right.
> 
I had the impression it was what was selected from. In any case, I changed it 
because without I get crash when vec_perm_indices is created later with a 
possible nparms of 2.

> > @@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator
> > *gsi)> 
> > return false;
> > 
> >op1 = gimple_assign_rhs1 (def_stmt);
> >ref = TREE_OPERAND (op1, 0);
> > 
> > -  if (orig)
> > +  if (orig1)
> > 
> > {
> > 
> > - if (ref != orig)
> > -   return false;
> > + if (ref == orig1 || orig2)
> > +   {
> > + if (ref != orig1 && ref != orig2)
> > +   return false;
> > +   }
> > + else
> > +   {
> > + if (TREE_CODE (ref) != SSA_NAME)
> > +   return false;
> > + if (! VECTOR_TYPE_P (TREE_TYPE (ref))
> > + || ! useless_type_conversion_p (TREE_TYPE (op1),
> > + TREE_TYPE (TREE_TYPE (ref
> > +   return false;
> > + if (TREE_TYPE (orig1) != TREE_TYPE (ref))
> > +   return false;
> 
> I think even different type is acceptable here, as long as its conversion to
> orig1's type is useless.
> 
> Furthermore, I think the way you wrote the patch with 2 variables rather
> than an array of 2 elements means too much duplication, this else block
> is a duplication of the else block below.  See the patch I've added to the
> PR
It seemed to me it was clearer like this, but I can see your point.

> (and sorry for missing your patch first, the PR wasn't ASSIGNED and there
> was no link to gcc-patches for it).
> 
It is okay. You are welcome to take it over. I am not a regular gcc 
contributor and thus not well-versed in the details, only the basic logic of 
how things work.

'Allan

Re: [Patch] Use two source permute for vector initialization (PR 85692)

2018-05-08 Thread Allan Sandfeld Jensen

On Dienstag, 8. Mai 2018 12:42:33 CEST Richard Biener wrote:
> On Tue, May 8, 2018 at 12:37 PM, Allan Sandfeld Jensen
> 
> <li...@carewolf.com> wrote:
> > I have tried to fix PR85692 that I opened.
> 
> Please add a testcase as well.  It also helps if you shortly tell what
> the patch does
> in your mail.
> 
Okay. I have updated the patch with a test-case based on my motivating 
examples. The patch just extends patching a vector construction to not just a 
single source permute instruction, but also a two source permute instruction.
commit 15c0f6a933d60b085416a59221851b604b955958
Author: Allan Sandfeld Jensen <allan.jen...@qt.io>
Date:   Tue May 8 13:16:18 2018 +0200

Try two source permute for vector construction

simplify_vector_constructor() was detecting when vector construction could
be implemented as a single source permute, but was not detecting when
it could be implemented as a double source permute. This patch adds the
    second case.

2018-05-08 Allan Sandfeld Jensen <allan.jen...@qt.io>

gcc/

PR tree-optimization/85692
* tree-ssa-forwprop.c (simplify_vector_constructor): Try two
source permute as well.

gcc/testsuite

* gcc.target/i386/pr85692.c: Test two simply constructions are
detected as permute instructions.

diff --git a/gcc/testsuite/gcc.target/i386/pr85692.c b/gcc/testsuite/gcc.target/i386/pr85692.c
new file mode 100644
index 000..322c1050161
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr85692.c
@@ -0,0 +1,18 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -msse4.1" } */
+/* { dg-final { scan-assembler "unpcklps" } } */
+/* { dg-final { scan-assembler "blendps" } } */
+/* { dg-final { scan-assembler-not "shufps" } } */
+/* { dg-final { scan-assembler-not "unpckhps" } } */
+
+typedef float v4sf __attribute__ ((vector_size (16)));
+
+v4sf unpcklps(v4sf a, v4sf b)
+{
+return v4sf{a[0],b[0],a[1],b[1]};
+}
+
+v4sf blendps(v4sf a, v4sf b)
+{
+return v4sf{a[0],b[1],a[2],b[3]};
+}
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 58ec6b47a5b..fbee8064160 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -2004,7 +2004,7 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 {
   gimple *stmt = gsi_stmt (*gsi);
   gimple *def_stmt;
-  tree op, op2, orig, type, elem_type;
+  tree op, op2, orig1, orig2, type, elem_type;
   unsigned elem_size, i;
   unsigned HOST_WIDE_INT nelts;
   enum tree_code code, conv_code;
@@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
   elem_type = TREE_TYPE (type);
   elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type));
 
-  vec_perm_builder sel (nelts, nelts, 1);
-  orig = NULL;
+  vec_perm_builder sel (nelts, 2, nelts);
+  orig1 = NULL;
+  orig2 = NULL;
   conv_code = ERROR_MARK;
   maybe_ident = true;
   FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (op), i, elt)
@@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 	return false;
   op1 = gimple_assign_rhs1 (def_stmt);
   ref = TREE_OPERAND (op1, 0);
-  if (orig)
+  if (orig1)
 	{
-	  if (ref != orig)
-	return false;
+	  if (ref == orig1 || orig2)
+	{
+	  if (ref != orig1 && ref != orig2)
+	return false;
+	}
+	  else
+	{
+	  if (TREE_CODE (ref) != SSA_NAME)
+		return false;
+	  if (! VECTOR_TYPE_P (TREE_TYPE (ref))
+		  || ! useless_type_conversion_p (TREE_TYPE (op1),
+		  TREE_TYPE (TREE_TYPE (ref
+		return false;
+	  if (TREE_TYPE (orig1) != TREE_TYPE (ref))
+		return false;
+	  orig2 = ref;
+	  maybe_ident = false;
+	  }
 	}
   else
 	{
@@ -2076,12 +2093,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 	  || ! useless_type_conversion_p (TREE_TYPE (op1),
 	  TREE_TYPE (TREE_TYPE (ref
 	return false;
-	  orig = ref;
+	  orig1 = ref;
 	}
   unsigned int elt;
   if (maybe_ne (bit_field_size (op1), elem_size)
 	  || !constant_multiple_p (bit_field_offset (op1), elem_size, ))
 	return false;
+  if (orig2 && ref == orig2)
+	elt += nelts;
   if (elt != i)
 	maybe_ident = false;
   sel.quick_push (elt);
@@ -2089,14 +2108,17 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
   if (i < nelts)
 return false;
 
-  if (! VECTOR_TYPE_P (TREE_TYPE (orig))
+  if (! VECTOR_TYPE_P (TREE_TYPE (orig1))
   || maybe_ne (TYPE_VECTOR_SUBPARTS (type),
-		   TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig
+		   TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig1
 return false;
 
+  if (!orig2)
+orig2 = orig1;
+
   tree tem;
   if (conv_code != ERROR_MARK
-  && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig),
+  && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig1),
 	   , _code)
 	  || conv_code

[Patch] Use two source permute for vector initialization (PR 85692)

2018-05-08 Thread Allan Sandfeld Jensen

I have tried to fix PR85692 that I opened.

2018-05-08  Allan Sandfeld Jense 

PR tree-optimization/85692
* tree-ssa-forwprop.c (simplify_vector_constructor): Detect
  two source permute operations as well.
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 58ec6b47a5b..fbee8064160 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -2004,7 +2004,7 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 {
   gimple *stmt = gsi_stmt (*gsi);
   gimple *def_stmt;
-  tree op, op2, orig, type, elem_type;
+  tree op, op2, orig1, orig2, type, elem_type;
   unsigned elem_size, i;
   unsigned HOST_WIDE_INT nelts;
   enum tree_code code, conv_code;
@@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
   elem_type = TREE_TYPE (type);
   elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type));
 
-  vec_perm_builder sel (nelts, nelts, 1);
-  orig = NULL;
+  vec_perm_builder sel (nelts, 2, nelts);
+  orig1 = NULL;
+  orig2 = NULL;
   conv_code = ERROR_MARK;
   maybe_ident = true;
   FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (op), i, elt)
@@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 	return false;
   op1 = gimple_assign_rhs1 (def_stmt);
   ref = TREE_OPERAND (op1, 0);
-  if (orig)
+  if (orig1)
 	{
-	  if (ref != orig)
-	return false;
+	  if (ref == orig1 || orig2)
+	{
+	  if (ref != orig1 && ref != orig2)
+	return false;
+	}
+	  else
+	{
+	  if (TREE_CODE (ref) != SSA_NAME)
+		return false;
+	  if (! VECTOR_TYPE_P (TREE_TYPE (ref))
+		  || ! useless_type_conversion_p (TREE_TYPE (op1),
+		  TREE_TYPE (TREE_TYPE (ref
+		return false;
+	  if (TREE_TYPE (orig1) != TREE_TYPE (ref))
+		return false;
+	  orig2 = ref;
+	  maybe_ident = false;
+	  }
 	}
   else
 	{
@@ -2076,12 +2093,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 	  || ! useless_type_conversion_p (TREE_TYPE (op1),
 	  TREE_TYPE (TREE_TYPE (ref
 	return false;
-	  orig = ref;
+	  orig1 = ref;
 	}
   unsigned int elt;
   if (maybe_ne (bit_field_size (op1), elem_size)
 	  || !constant_multiple_p (bit_field_offset (op1), elem_size, ))
 	return false;
+  if (orig2 && ref == orig2)
+	elt += nelts;
   if (elt != i)
 	maybe_ident = false;
   sel.quick_push (elt);
@@ -2089,14 +2108,17 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
   if (i < nelts)
 return false;
 
-  if (! VECTOR_TYPE_P (TREE_TYPE (orig))
+  if (! VECTOR_TYPE_P (TREE_TYPE (orig1))
   || maybe_ne (TYPE_VECTOR_SUBPARTS (type),
-		   TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig
+		   TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig1
 return false;
 
+  if (!orig2)
+orig2 = orig1;
+
   tree tem;
   if (conv_code != ERROR_MARK
-  && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig),
+  && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig1),
 	   , _code)
 	  || conv_code == CALL_EXPR))
 return false;
@@ -2104,16 +2126,16 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
   if (maybe_ident)
 {
   if (conv_code == ERROR_MARK)
-	gimple_assign_set_rhs_from_tree (gsi, orig);
+	gimple_assign_set_rhs_from_tree (gsi, orig1);
   else
-	gimple_assign_set_rhs_with_ops (gsi, conv_code, orig,
+	gimple_assign_set_rhs_with_ops (gsi, conv_code, orig1,
 	NULL_TREE, NULL_TREE);
 }
   else
 {
   tree mask_type;
 
-  vec_perm_indices indices (sel, 1, nelts);
+  vec_perm_indices indices (sel, 2, nelts);
   if (!can_vec_perm_const_p (TYPE_MODE (type), indices))
 	return false;
   mask_type
@@ -2125,15 +2147,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 	return false;
   op2 = vec_perm_indices_to_tree (mask_type, indices);
   if (conv_code == ERROR_MARK)
-	gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig, orig, op2);
+	gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig1, orig2, op2);
   else
 	{
 	  gimple *perm
-	= gimple_build_assign (make_ssa_name (TREE_TYPE (orig)),
-   VEC_PERM_EXPR, orig, orig, op2);
-	  orig = gimple_assign_lhs (perm);
+	= gimple_build_assign (make_ssa_name (TREE_TYPE (orig1)),
+   VEC_PERM_EXPR, orig1, orig2, op2);
 	  gsi_insert_before (gsi, perm, GSI_SAME_STMT);
-	  gimple_assign_set_rhs_with_ops (gsi, conv_code, orig,
+	  gimple_assign_set_rhs_with_ops (gsi, conv_code, gimple_assign_lhs (perm),
 	  NULL_TREE, NULL_TREE);
 	}
 }

Re: Resolving LTO symbols for static library boundary

2018-02-07 Thread Allan Sandfeld Jensen

On Dienstag, 6. Februar 2018 01:01:29 CET Jan Hubicka wrote:
> Dne 2018-02-05 18:44, Richard Biener napsal:
> > On February 5, 2018 12:26:58 PM GMT+01:00, Allan Sandfeld Jensen
> > 
> > <li...@carewolf.com> wrote:
> >> Hello GCC
> >> 
> >> In trying to make it possible to use LTO for distro-builds of Qt, I
> >> have again
> >> hit the problem of static libraries. In Qt in general we for LTO rely
> >> on a
> >> library boundary, where LTO gets resolved when generating the library
> >> but no
> >> LTO-symbols are exported in the shared library. This ensure the
> >> library
> >> has a
> >> well defined binary compatible interface and gets LTO optimizations
> >> within
> >> each library. For some private libraries we use static libraries
> >> however,
> >> because we don't need binary compatibility, but though we don't need
> >> BC
> >> 
> >> between Qt versions, the libraries should still be linkable with
> >> different gcc
> >> versions (and with different compilers). However when LTO is enabled
> >> the
> >> static libraries will contain definitions that depend on a single gcc
> >> version
> >> making it unsuitable for distribution.
> >> 
> >> One solution is to enable fat-lto object files for static libraries
> >> but
> >> that
> >> is both a waste of space and compile time, and disables any LTO
> >> optimization
> >> within the library. Ideally I would like to have the static library do
> >> LTO
> >> optimizations internally just like a shared library, but then exported
> >> as
> >> static library.
> >> 
> >> I suspect this is more of gcc task than ar/ld task, as it basically
> >> boils down
> >> to gcc doing for a static library what it does for shared library, but
> >> maybe
> >> export the result as a single combined .o file, that can then be ar'ed
> >> into a
> >> compatible static library.
> >> 
> >> Is this possible?
> > 
> > Hmm. I think you could partially link the static archive contents into
> > a single relocatable object. Or we could add a mode where you do a
> > 1to1 LTO link of the objects and stop at the ltrans object files. You
> > could stuff those into an archive again.
> > 
> > I'm not sure how far Honza got partial LTO linking to work?
> 
> Parital linking of lto .o files into single non-lto .o file should work
> and it will get you cross-module optimization done. The problem is that
> without resolution info compiler needs to assume that all symbols
> exported by object files are possibly referneced by the later
> incremental link and thus the code quality will definitly not be
> comparable with what you get for LTO on final binary or DSO. Still
> should be better than non-lto build.
> I would be curious if it is useful for you in practice.
> 
How would I do that partial link, and what are the requirements?

Best regards
'Allan

Resolving LTO symbols for static library boundary

2018-02-05 Thread Allan Sandfeld Jensen

Hello GCC

In trying to make it possible to use LTO for distro-builds of Qt, I have again 
hit the problem of static libraries. In Qt in general we for LTO rely on a 
library boundary, where LTO gets resolved when generating the library but no 
LTO-symbols are exported in the shared library. This ensure the library has a 
well defined binary compatible interface and gets LTO optimizations within 
each library. For some private libraries we use static libraries however, 
because we don't need binary compatibility, but though we don't need BC 
between Qt versions, the libraries should still be linkable with different gcc 
versions (and with different compilers). However when LTO is enabled the 
static libraries will contain definitions that depend on a single gcc version 
making it unsuitable for distribution.

One solution is to enable fat-lto object files for static libraries but that 
is both a waste of space and compile time, and disables any LTO optimization 
within the library. Ideally I would like to have the static library do LTO 
optimizations internally just like a shared library, but then exported as 
static library.

I suspect this is more of gcc task than ar/ld task, as it basically boils down 
to gcc doing for a static library what it does for shared library, but maybe 
export the result as a single combined .o file, that can then be ar'ed into a 
compatible static library.

Is this possible?

Best regards
'Allan Jensen

Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen

On Mittwoch, 13. September 2017 15:46:09 CEST Jakub Jelinek wrote:
> On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> > On its own -O3 doesn't add much (some loop opts and slightly more
> > aggressive inlining/unrolling), so whatever it does we
> > should consider doing at -O2 eventually.
> 
> Well, -O3 adds vectorization, which we don't enable at -O2 by default.
> 
Would it be possible to enable basic block vectorization on -O2? I assume that 
doesn't increase binary size since it doesn't unroll loops.

'Allan

Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen

On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote:
> > On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra  wrote:
> > 
> > Hi all,
> > 
> > At the GNU Cauldron I was inspired by several interesting talks about
> > improving GCC in various ways. While GCC has many great optimizations, a
> > common theme is that its default settings are rather conservative. As a
> > result users are required to enable several additional optimizations by
> > hand to get good code. Other compilers enable more optimizations at -O2
> > (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should
> > do as well.
> 
> There are some nuances to -O2. Please consider -O2 users who wish use it
> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).
> 
> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase
> code size can be skipped from -Os without drastically effecting
> performance.
> 
> This is not the case with GCC where -Os is a size at all costs optimisation
> mode. GCC users option for size not at the expense of speed is to use -O2.
> 
> Clang GCC
> -Oz   ~=  -Os
> -Os   ~=  -O2
> 
No. Clang's -Os is somewhat limited compared to gcc's, just like the clang -Og 
is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in 
clang proper.

'Allan

Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen

On Samstag, 26. August 2017 12:59:06 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 12:40 +0200, Allan Sandfeld Jensen wrote:
> > On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> > > On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > > > First let me put into some perspective on -Os usage and some history:
> > > > 1) -Os is not useful for non-embedded users
> > > > 2) the embedded folks really need the smallest code possible and
> > > > usually will be willing to afford the performance hit
> > > > 3) -Os was a mistake for Apple to use in the first place; they used it
> > > > and then GCC got better for PowerPC to use the string instructions
> > > > which is why -Oz was added :)
> > > > 4) -Os is used heavily by the arm/thumb2 folks in bare metal
> > > > applications.
> > > > 
> > > > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > > > different instructions and encodings.
> > > > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > > > big issue.
> > > > I soon have a need to keep overall (bare-metal) application size down
> > > > to just 256k.
> > > > Micro-controllers are places where -Os matters the most.
> > > > 
> > > > This comment does not help my application usage.  It rather hurts it
> > > > and goes against what -Os is really about.  It is not about reducing
> > > > icache pressure but overall application code size.  I really need the
> > > > code to fit into a specific size.
> > > 
> > > For many applications using -flto does reduce code size more than just
> > > going from -O2 to -Os.
> > 
> > I added the option to optimize with -Os in Qt, and it gives an average 15%
> > reduction in binary size, somtimes as high as 25%. Using lto gives almost
> > the same (slightly less), but the two options combine perfectly and using
> > both can reduce binary size from 20 to 40%. And that is on a shared
> > library, not even a statically linked binary.
> > 
> > Only real minus is that some of the libraries especially QtGui would
> > benefit from a auto-vectorization, so it would be nice if there existed
> > an -O3s version which vectorized the most obvious vectorizable functions,
> > a few hundred bytes for an additional version here and there would do
> > good. Fortunately it doesn't too much damage as we have manually
> > vectorized routines for to have good performance also on MSVC, if we
> > relied more on auto- vectorization it would be worse.
> 
> In that case using profile guided optimizations will help. It will
> optimize cold functions with -Os and hot functions with -O3 (when using
> e.g.: "-flto -O3 -fprofile-use"). Of course you will have to compile
> twice and also collect training data from your library in between.

Yeah. That is just more problematic in practice. Though I do believe we have 
support for it. It is good to know it will automatically upgrade optimizations 
like that. I just wish there was a way to distribute pre-generated arch-
independent training data.

`Allan

Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen

On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > First let me put into some perspective on -Os usage and some history:
> > 1) -Os is not useful for non-embedded users
> > 2) the embedded folks really need the smallest code possible and
> > usually will be willing to afford the performance hit
> > 3) -Os was a mistake for Apple to use in the first place; they used it
> > and then GCC got better for PowerPC to use the string instructions
> > which is why -Oz was added :)
> > 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> > 
> > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > different instructions and encodings.
> > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > big issue.
> > I soon have a need to keep overall (bare-metal) application size down
> > to just 256k.
> > Micro-controllers are places where -Os matters the most.
> > 
> > This comment does not help my application usage.  It rather hurts it
> > and goes against what -Os is really about.  It is not about reducing
> > icache pressure but overall application code size.  I really need the
> > code to fit into a specific size.
> 
> For many applications using -flto does reduce code size more than just
> going from -O2 to -Os.

I added the option to optimize with -Os in Qt, and it gives an average 15% 
reduction in binary size, somtimes as high as 25%. Using lto gives almost the 
same (slightly less), but the two options combine perfectly and using both can 
reduce binary size from 20 to 40%. And that is on a shared library, not even a 
statically linked binary.

Only real minus is that some of the libraries especially QtGui would benefit 
from a auto-vectorization, so it would be nice if there existed an -O3s 
version which vectorized the most obvious vectorizable functions, a few 
hundred bytes for an additional version here and there would do good. 
Fortunately it doesn't too much damage as we have manually vectorized routines 
for to have good performance also on MSVC, if we relied more on auto-
vectorization it would be worse.

`Allan

Re: [RFC] GCC 8 Project proposal: Extensions supporting C Metaprogramming, pseudo-templates

2017-05-09 Thread Allan Sandfeld Jensen

On Tuesday 09 May 2017, Daniel Santos wrote:
> The primary aim is to facilitate high-performance generic C
> libraries for software where C++ is not suitable, but the cost of
> run-time abstraction is unacceptable. A good example is the Linux
> kernel, where the source tree is littered with more than 100 hand-coded
> or boiler-plate (copy, paste and edit) search cores required to use the
> red-black tree library.
That is not a good excuse, they can just use a defined subset of C++. The cost 
of C++ abstractions is zero if you don't use them. 

> 
> To further the usefulness of such techniques, I propose the addition of
> a c-family attribute to declare a parameter, variable (and possibly
> other declarations) as "constprop" or some similar word. The purpose of
> the attribute is to:
> 
> 1.) Emit a warning or error when the value is not optimized away, and
> 2.) Direct various optimization passes to prefer (or force) either
> cloning or inlining of a function with such a parameter.
> 
This I can get more behind, I have wanted a constexpr attribute for C for some 
time. If not for anything else to be able to use it in shared/system headers 
that can be used by both C and C++ and in C++ would be useful in constexpr 
expressions. If you can find a use for it in pure C as well, so much the 
better.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-05-02 Thread Allan Sandfeld Jensen

On Tuesday 02 May 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 03:15:11PM +0200, Allan Sandfeld Jensen wrote:
> > Okay, I have tried that, and I also made it more obvious how the
> > intrinsics can become non-immediate shift.
> > 
> > 
> > diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> > index b58f5050db0..b9406550fc5 100644
> > --- a/gcc/ChangeLog
> > +++ b/gcc/ChangeLog
> > @@ -1,3 +1,10 @@
> > +2017-04-22  Allan Sandfeld Jensen  <sandf...@kde.org>
> > +
> > +   * config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*):
> > +   Use vector intrinstics instead of builtins.
> > +   * config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*):
> > +   Use vector intrinstics instead of builtins.
> > +
> > 
> >  2017-04-21  Uros Bizjak  <ubiz...@gmail.com>
> >  
> > * config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv.
> > 
> > diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h
> > index 82f170a3d61..64ba52b244e 100644
> > --- a/gcc/config/i386/avx2intrin.h
> > +++ b/gcc/config/i386/avx2intrin.h
> > @@ -665,13 +665,6 @@ _mm256_slli_si256 (__m256i __A, const int __N)
> > 
> >  extern __inline __m256i
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > 
> > -_mm256_slli_epi16 (__m256i __A, int __B)
> > -{
> > -  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
> > -}
> > -
> > -extern __inline __m256i
> > -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > 
> >  _mm256_sll_epi16 (__m256i __A, __m128i __B)
> >  {
> >  
> >return (__m256i)__builtin_ia32_psllw256((__v16hi)__A, (__v8hi)__B);
> > 
> > @@ -679,9 +672,11 @@ _mm256_sll_epi16 (__m256i __A, __m128i __B)
> > 
> >  extern __inline __m256i
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > 
> > -_mm256_slli_epi32 (__m256i __A, int __B)
> > +_mm256_slli_epi16 (__m256i __A, int __B)
> > 
> >  {
> > 
> > -  return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B);
> > +  if (__builtin_constant_p(__B))
> > +return ((unsigned int)__B < 16) ? (__m256i)((__v16hi)__A << __B) :
> > _mm256_setzero_si256(); +  return _mm256_sll_epi16(__A,
> > _mm_cvtsi32_si128(__B));
> > 
> >  }
> 
> The formatting is wrong, missing spaces before function names and opening
> (, too long lines.  Also, you've removed some builtin uses like
> __builtin_ia32_psllwi256 above, but haven't removed those builtins from the
> compiler (unlike the intrinsics, the builtins are not supported and can be
> removed).  But I guess the primary question is on Uros, do we
> want to handle this in the *intrin.h headers and thus increase the size
> of those (and their parsing time etc.), or do we want to handle this
> in the target folders (tree as well as gimple one), where we'd convert
> e.g. __builtin_ia32_psllwi256 to the shift if the shift count is constant.
> 
Ok. I will await what you decide.

Btw. I thought of an alternative idea: Make a new set of built-ins, called for 
instance __builtin_lshift and __builtin_rshift, that translates simply to 
GIMPLE shifts, just like cpp_shifts currently does, the only difference being 
the new shifts (unlike C/C++ shifts) are defined for all shift sizes and on 
negative values.  With this also variable shift intrinsics can be written 
without builtins. Though to do this would making a whole set of them for all 
integer types, it would need to be implemented in the c-parser like 
__buitin_shuffle, and not with the other generic builtins.

Would that make sense?

Best regards
`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote:
> > --- a/gcc/config/i386/avx2intrin.h
> > +++ b/gcc/config/i386/avx2intrin.h
> > @@ -667,7 +667,7 @@ extern __inline __m256i
> > 
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> >  _mm256_slli_epi16 (__m256i __A, int __B)
> >  {
> > 
> > -  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
> > +  return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) :
> > _mm256_setzero_si256();
> > 
> >  }
> 
> What is the advantage of doing that when you replace one operation with
> several (&, <, ?:, <<)?
> I'd say instead we should fold the builtins if in the gimple fold target
> hook we see the shift count constant and can decide based on that.
> Or we could use __builtin_constant_p (__B) to decide whether to use
> the generic vector shifts or builtin, but that means larger IL.
> 
Okay, I have tried that, and I also made it more obvious how the intrinsics 
can become non-immediate shift.

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index b58f5050db0..b9406550fc5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2017-04-22  Allan Sandfeld Jensen  <sandf...@kde.org>
+
+	* config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*):
+	Use vector intrinstics instead of builtins.
+	* config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*):
+	Use vector intrinstics instead of builtins.
+
 2017-04-21  Uros Bizjak  <ubiz...@gmail.com>
 
 	* config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv.
diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h
index 82f170a3d61..64ba52b244e 100644
--- a/gcc/config/i386/avx2intrin.h
+++ b/gcc/config/i386/avx2intrin.h
@@ -665,13 +665,6 @@ _mm256_slli_si256 (__m256i __A, const int __N)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm256_slli_epi16 (__m256i __A, int __B)
-{
-  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
-}
-
-extern __inline __m256i
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_sll_epi16 (__m256i __A, __m128i __B)
 {
   return (__m256i)__builtin_ia32_psllw256((__v16hi)__A, (__v8hi)__B);
@@ -679,9 +672,11 @@ _mm256_sll_epi16 (__m256i __A, __m128i __B)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm256_slli_epi32 (__m256i __A, int __B)
+_mm256_slli_epi16 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B);
+  if (__builtin_constant_p(__B))
+return ((unsigned int)__B < 16) ? (__m256i)((__v16hi)__A << __B) : _mm256_setzero_si256();
+  return _mm256_sll_epi16(__A, _mm_cvtsi32_si128(__B));
 }
 
 extern __inline __m256i
@@ -693,9 +688,11 @@ _mm256_sll_epi32 (__m256i __A, __m128i __B)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm256_slli_epi64 (__m256i __A, int __B)
+_mm256_slli_epi32 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psllqi256 ((__v4di)__A, __B);
+  if (__builtin_constant_p(__B))
+return ((unsigned int)__B < 32) ? (__m256i)((__v8si)__A << __B) : _mm256_setzero_si256();
+  return _mm256_sll_epi32(__A, _mm_cvtsi32_si128(__B));
 }
 
 extern __inline __m256i
@@ -707,6 +704,15 @@ _mm256_sll_epi64 (__m256i __A, __m128i __B)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm256_slli_epi64 (__m256i __A, int __B)
+{
+  if (__builtin_constant_p(__B))
+return ((unsigned int)__B < 64) ? (__m256i)((__v4di)__A << __B) : _mm256_setzero_si256();
+  return _mm256_sll_epi64(__A, _mm_cvtsi32_si128(__B));
+}
+
+extern __inline __m256i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_srai_epi16 (__m256i __A, int __B)
 {
   return (__m256i)__builtin_ia32_psrawi256 ((__v16hi)__A, __B);
@@ -756,13 +762,6 @@ _mm256_srli_si256 (__m256i __A, const int __N)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm256_srli_epi16 (__m256i __A, int __B)
-{
-  return (__m256i)__builtin_ia32_psrlwi256 ((__v16hi)__A, __B);
-}
-
-extern __inline __m256i
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_srl_epi16 (__m256i __A, __m128i __B)
 {
   return (__m256i)__builtin_ia32_psrlw256((__v16hi)__A, (__v8hi)__B);
@@ -770,9 +769,11 @@ _mm256_srl_epi16 (__m256i __A, __m128i __B)
 
 extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm256_srli_epi32 (__m256i __A, int __B)
+_mm256_srli_epi16 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psrldi256 ((__v8si)__A, __B);
+  if (__builtin_constant_p(__B))
+return ((unsigned int)__B < 16) ? (__m256i)((__v16hu)__A >>

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 11:01:29AM +0200, Allan Sandfeld Jensen wrote:
> > On Monday 24 April 2017, Jakub Jelinek wrote:
> > > On Mon, Apr 24, 2017 at 10:34:58AM +0200, Allan Sandfeld Jensen wrote:
> > > > That is a different instruction. That is the vpsllw not vpsllwi
> > > > 
> > > > The intrinsics I changed is the immediate version, I didn't change
> > > > the non- immediate version. It is probably a bug if you can give
> > > > non-immediate values to the immediate only intrinsic. At least both
> > > > versions handles it, if in different ways, but is is illegal
> > > > arguments.
> > > 
> > > The documentation is unclear on that and I've only recently fixed up
> > > some cases where these intrinsics weren't able to handle non-constant
> > > arguments in some cases, while both ICC and clang coped with that
> > > fine.
> > > So it is clearly allowed and handled by all the compilers and needs to
> > > be supported, people use that in real-world code.
> > 
> > Undoubtedly it happens. I just make a mistake myself that created that
> > case. But it is rather unfortunate, and means we make wrong code
> > currently for corner case values.
> 
> The intrinsic documentation is poor, usually you have a good documentation
> on what the instructions do, and then you just have to guess what the
> intrinsics do.  You can of course ask Intel for clarification.
> 
> If you try:
> #include 
> 
> __m128i
> foo (__m128i a, int b)
> {
>   return _mm_slli_epi16 (a, b);
> }
> and call it with 257 from somewhere else, you can see that all the
> compilers will give you zero vector.  And similarly if you use 257
> literally instead of b.  So what the intrinsic (unlike the instruction)
> actually does is that it compares all bits of the imm8 argument (supposedly
> using unsigned comparison) and if it is bigger than 15 (or 7 or 31 or 63
> depending on the bitsize of element) it yields 0 vector.
> 
Good point. I was using intel's documentation at 
https://software.intel.com/sites/landingpage/IntrinsicsGuide/, but if all 
compilers including us does something else, practicality wins.

It did make me curious and test out what _mm_slli_epi16(v, -250); compiles to. 
For some reason that becomes an undefined shift using the non-immediate sll in 
gcc, but returns the zero-vector in clang. With my patch it was a 6 bit shift, 
but that is apparently not de-facto standard.


`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 10:34:58AM +0200, Allan Sandfeld Jensen wrote:
> > That is a different instruction. That is the vpsllw not vpsllwi
> > 
> > The intrinsics I changed is the immediate version, I didn't change the
> > non- immediate version. It is probably a bug if you can give
> > non-immediate values to the immediate only intrinsic. At least both
> > versions handles it, if in different ways, but is is illegal arguments.
> 
> The documentation is unclear on that and I've only recently fixed up some
> cases where these intrinsics weren't able to handle non-constant arguments
> in some cases, while both ICC and clang coped with that fine.
> So it is clearly allowed and handled by all the compilers and needs to be
> supported, people use that in real-world code.
> 
Undoubtedly it happens. I just make a mistake myself that created that case. 
But it is rather unfortunate, and means we make wrong code currently for 
corner case values.

Note the difference in definition between the two intrinsics: 
_mm_ssl_epi16:
FOR j := 0 to 7
i := j*16
IF count[63:0] > 15
dst[i+15:i] := 0
ELSE
dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0])
FI
ENDFOR

_mm_ssli_epi16:
FOR j := 0 to 7
i := j*16
IF imm8[7:0] > 15
dst[i+15:i] := 0
ELSE
dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0])
FI
ENDFOR

For a value such as 257, the immediate version does a 1 bit shift, while the 
non-immediate returns a zero vector. A simple function using the immediate 
intrinsic has to have an if-statement, if transformed to using the non-
immediate instruction.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Allan Sandfeld Jensen wrote:
> On Monday 24 April 2017, Jakub Jelinek wrote:
> > On Mon, Apr 24, 2017 at 10:02:40AM +0200, Allan Sandfeld Jensen wrote:
> > > > That said, both the options I've mentioned above provide the same
> > > > advantages and don't have the disadvantages of pessimizing normal
> > > > code.
> > > 
> > > What pessimizing? This produce the same or better code for all legal
> > > arguments. The only difference besides better generated code is that it
> > > allows
> > 
> > No.  Have you really tried that?
> > 
> > > the intrinsics to be used incorrectly with non-literal arguments
> > > because we lack the C-extension for constexp to prevent that.
> > 
> > Consider e.g. -O2 -mavx2 -mtune=intel:
> > #include 
> > 
> > __m256i
> > foo (__m256i x, int s)
> > {
> > 
> >   return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)x, s);
> > 
> > }
> > 
> > __m256i
> > bar (__m256i x, int s)
> > {
> > 
> >   return ((s & 0xff) < 16) ? (__m256i)((__v16hi)x << (s & 0xff)) :
> > _mm256_setzero_si256 (); }
> > 
> > The first one generates
> > 
> > movl%edi, %edi
> > vmovq   %rdi, %xmm1
> > vpsllw  %xmm1, %ymm0, %ymm0
> > ret
> > 
> > (because that is actually what the instruction does), the second one
> 
> That is a different instruction. That is the vpsllw not vpsllwi
> 
> The intrinsics I changed is the immediate version, I didn't change the non-
> immediate version. It is probably a bug if you can give non-immediate
> values to the immediate only intrinsic. At least both versions handles it,
> if in different ways, but is is illegal arguments.
> 
Though I now that I think about it, this means my change of to the existing 
sse-psslw-1.c test and friends is wrong, because it uses variable input.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 10:02:40AM +0200, Allan Sandfeld Jensen wrote:
> > > That said, both the options I've mentioned above provide the same
> > > advantages and don't have the disadvantages of pessimizing normal code.
> > 
> > What pessimizing? This produce the same or better code for all legal
> > arguments. The only difference besides better generated code is that it
> > allows
> 
> No.  Have you really tried that?
> 
> > the intrinsics to be used incorrectly with non-literal arguments because
> > we lack the C-extension for constexp to prevent that.
> 
> Consider e.g. -O2 -mavx2 -mtune=intel:
> #include 
> 
> __m256i
> foo (__m256i x, int s)
> {
>   return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)x, s);
> }
> 
> __m256i
> bar (__m256i x, int s)
> {
>   return ((s & 0xff) < 16) ? (__m256i)((__v16hi)x << (s & 0xff)) :
> _mm256_setzero_si256 (); }
> 
> The first one generates
> movl%edi, %edi
> vmovq   %rdi, %xmm1
> vpsllw  %xmm1, %ymm0, %ymm0
> ret
> (because that is actually what the instruction does), the second one
That is a different instruction. That is the vpsllw not vpsllwi

The intrinsics I changed is the immediate version, I didn't change the non-
immediate version. It is probably a bug if you can give non-immediate values 
to the immediate only intrinsic. At least both versions handles it, if in 
different ways, but is is illegal arguments.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 09:51:29AM +0200, Allan Sandfeld Jensen wrote:
> > On Monday 24 April 2017, Jakub Jelinek wrote:
> > > On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote:
> > > > --- a/gcc/config/i386/avx2intrin.h
> > > > +++ b/gcc/config/i386/avx2intrin.h
> > > > @@ -667,7 +667,7 @@ extern __inline __m256i
> > > > 
> > > >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > > >  _mm256_slli_epi16 (__m256i __A, int __B)
> > > >  {
> > > > 
> > > > -  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
> > > > +  return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B &
> > > > 0xff)) : _mm256_setzero_si256();
> > > > 
> > > >  }
> > > 
> > > What is the advantage of doing that when you replace one operation with
> > > several (&, <, ?:, <<)?
> > > I'd say instead we should fold the builtins if in the gimple fold
> > > target hook we see the shift count constant and can decide based on
> > > that. Or we could use __builtin_constant_p (__B) to decide whether to
> > > use the generic vector shifts or builtin, but that means larger IL.
> > 
> > The advantage is that in this builtin, the __B is always a literal (or
> > constexpr), so the if statement is resolved at compile time.
> 
> Do we really want to support all the thousands _mm* intrinsics in constexpr
> contexts?  People can just use generic vectors instead.
> 
I would love to support it, but first we need a C extension attribute matching 
constexpr, and I consider it a separate issue.

> That said, both the options I've mentioned above provide the same
> advantages and don't have the disadvantages of pessimizing normal code.
> 
What pessimizing? This produce the same or better code for all legal 
arguments. The only difference besides better generated code is that it allows 
the intrinsics to be used incorrectly with non-literal arguments because we 
lack the C-extension for constexp to prevent that.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Monday 24 April 2017, Jakub Jelinek wrote:
> On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote:
> > --- a/gcc/config/i386/avx2intrin.h
> > +++ b/gcc/config/i386/avx2intrin.h
> > @@ -667,7 +667,7 @@ extern __inline __m256i
> > 
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> >  _mm256_slli_epi16 (__m256i __A, int __B)
> >  {
> > 
> > -  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
> > +  return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) :
> > _mm256_setzero_si256();
> > 
> >  }
> 
> What is the advantage of doing that when you replace one operation with
> several (&, <, ?:, <<)?
> I'd say instead we should fold the builtins if in the gimple fold target
> hook we see the shift count constant and can decide based on that.
> Or we could use __builtin_constant_p (__B) to decide whether to use
> the generic vector shifts or builtin, but that means larger IL.

The advantage is that in this builtin, the __B is always a literal (or 
constexpr), so the if statement is resolved at compile time.

`Allan

Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts

2017-04-24 Thread Allan Sandfeld Jensen

On Saturday 22 April 2017, Allan Sandfeld Jensen wrote:
> Replaces definitions of immediate logical shift intrinsics with GCC
> extension syntax. Tests are added to ensure the intrinsics still produce
> the right instructions and that a few basic optimizations now work.
> 
> Compared to the earlier version of the patch, all potentially undefined
> shifts are now avoided, which also means no variable shifts or arithmetic
> right shifts.

Fixed 2 errors in the tests.
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index b58f5050db0..b9406550fc5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2017-04-22  Allan Sandfeld Jensen  <sandf...@kde.org>
+
+	* config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*):
+	Use vector intrinstics instead of builtins.
+	* config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*):
+	Use vector intrinstics instead of builtins.
+
 2017-04-21  Uros Bizjak  <ubiz...@gmail.com>
 
 	* config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv.
diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h
index 82f170a3d61..acb49734131 100644
--- a/gcc/config/i386/avx2intrin.h
+++ b/gcc/config/i386/avx2intrin.h
@@ -667,7 +667,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_slli_epi16 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B);
+  return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
@@ -681,7 +681,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_slli_epi32 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B);
+  return ((__B & 0xff) < 32) ? (__m256i)((__v8si)__A << (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
@@ -695,7 +695,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_slli_epi64 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psllqi256 ((__v4di)__A, __B);
+  return ((__B & 0xff) < 64) ? (__m256i)((__v4di)__A << (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
@@ -758,7 +758,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_srli_epi16 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psrlwi256 ((__v16hi)__A, __B);
+  return ((__B & 0xff) < 16) ? (__m256i) ((__v16hu)__A >> (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
@@ -772,7 +772,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_srli_epi32 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psrldi256 ((__v8si)__A, __B);
+  return ((__B & 0xff) < 32) ? (__m256i) ((__v8su)__A >> (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
@@ -786,7 +786,7 @@ extern __inline __m256i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm256_srli_epi64 (__m256i __A, int __B)
 {
-  return (__m256i)__builtin_ia32_psrlqi256 ((__v4di)__A, __B);
+  return ((__B & 0xff) < 64) ? (__m256i) ((__v4du)__A >> (__B & 0xff)) : _mm256_setzero_si256();
 }
 
 extern __inline __m256i
diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h
index 828f4a07a9b..5c048d9fd0d 100644
--- a/gcc/config/i386/emmintrin.h
+++ b/gcc/config/i386/emmintrin.h
@@ -1140,19 +1140,19 @@ _mm_mul_epu32 (__m128i __A, __m128i __B)
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_slli_epi16 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_psllwi128 ((__v8hi)__A, __B);
+  return ((__B & 0xff) < 16) ? (__m128i)((__v8hi)__A << (__B & 0xff)) : _mm_setzero_si128();
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_slli_epi32 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_pslldi128 ((__v4si)__A, __B);
+  return ((__B & 0xff) < 32) ? (__m128i)((__v4si)__A << (__B & 0xff)) : _mm_setzero_si128();
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_slli_epi64 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_psllqi128 ((__v2di)__A, __B);
+  return ((__B & 0xff) < 64) ? (__m128i)((__v2di)__A << (__B & 0xff)) : _mm_setzero_si128();
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
@@ -1205,19 +1205,19 @@ _mm_slli_si128 (__m128i __A, const int __N)
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_srli_epi16 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B);
+  return ((__B & 0xff) < 16) ? (__

Re: C++17 by default in gcc-8

2017-03-26 Thread Allan Sandfeld Jensen

On Sunday 26 March 2017, Egor Pugin wrote:
> Hi,
> 
> It's a bit early question, but still.
> C++ releases became more predictive and regular.
> What do you think about settings -std=c++17 (or gnu++17) for gcc-8 as
> default c++ mode?
> What is your policy regarding default c++ standards in gcc?

It would make sense in it being the second release with full c++17 support, so 
we could expect the support to be mature in gcc 8.  I am a little sceptical 
personally though, because I just recently got burned by a C++17 change, where 
a number of uses of functor classes broke because pointers to functions with 
noexcept is now a separate type. So code that worked with C++14 is now illegal 
in C++17 and requires doubling the number of specialized functor templates, 
and still isn't fully source compatible with C++14 (though all nice clean code 
will be compatible)

Best regards
`Allan

Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64

2016-12-09 Thread Allan Sandfeld Jensen

On Tuesday 06 December 2016, Uros Bizjak wrote:
> Hello!
> 
> > 2016-11-30  Allan Sandfeld Jensen  <allan.jen...@qt.io>
> > 
> >PR target/70118
> >* gcc/config/i386/mmintrin.h (__m64_u): New type
> >* gcc/config/i386/emmintrin.h (_mm_loadl_epi64, _mm_storel_epi64):
> >Make the allowed unaligned memory access explicit.
> 
> OK.
> 
Thanks

Note I don't have write access.

Best regards
`Allan

Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64

2016-12-05 Thread Allan Sandfeld Jensen

Trying again, this time with changelog.

gcc/

2016-11-30  Allan Sandfeld Jensen  <allan.jen...@qt.io>

PR target/70118
* gcc/config/i386/mmintrin.h (__m64_u): New type
* gcc/config/i386/emmintrin.h (_mm_loadl_epi64, _mm_storel_epi64):
   Make the allowed unaligned memory access explicit.
Index: gcc/config/i386/emmintrin.h
===
--- gcc/config/i386/emmintrin.h	(revision 242892)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -703,9 +703,9 @@
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_loadl_epi64 (__m128i const *__P)
+_mm_loadl_epi64 (__m128i_u const *__P)
 {
-  return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P);
+  return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P);
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
@@ -721,9 +721,9 @@
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_storel_epi64 (__m128i *__P, __m128i __B)
+_mm_storel_epi64 (__m128i_u *__P, __m128i __B)
 {
-  *(long long *)__P = ((__v2di)__B)[0];
+  *(__m64_u *)__P = (__m64) ((__v2di)__B)[0];
 }
 
 extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
Index: gcc/config/i386/mmintrin.h
===
--- gcc/config/i386/mmintrin.h	(revision 242892)
+++ gcc/config/i386/mmintrin.h	(working copy)
@@ -37,6 +37,9 @@
vector types, and their scalar components.  */
 typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));
 
+/* Unaligned version of the same type  */
+typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1)));
+
 /* Internal data types for implementing the intrinsics.  */
 typedef int __v2si __attribute__ ((__vector_size__ (8)));
 typedef short __v4hi __attribute__ ((__vector_size__ (8)));

Re: [PING][RFC] Assertions as optimization hints

2016-11-27 Thread Allan Sandfeld Jensen

On Wednesday 23 November 2016, Richard Biener wrote:
> On Tue, Nov 22, 2016 at 6:52 PM, Yuri Gribov  wrote:
> > Hi all,
> > 
> > I've recently revisited an ancient patch from Paolo
> > (https://gcc.gnu.org/ml/gcc-patches/2004-04/msg00551.html) which uses
> > asserts as optimization hints. I've rewritten the patch to be more
> > stable under expressions with side-effects and did some basic
> > investigation of it's efficacy.
> > 
> > Optimization is hidden under !defined NDEBUG && defined
> > __ASSUME_ASSERTS__. !NDEBUG-part is necessary because assertions often
> > rely on special !NDEBUG-protected support code outside of assert
> > (dedicated fields in structures and similar stuff, collectively called
> > "ghost variables"). __ASSUME_ASSERTS__ gives user a choice whether to
> > enable optimization or not (should probably be hidden under a friendly
> > compiler switch e.g. -fassume-asserts).
> > 
> > I do not have access to a good machine for speed benchmarks so I only
> > looked at size improvements in few popular projects. There are no
> > revolutionary changes (0.1%-1%) but some functions see good reductions
> > which may result in noticeable runtime improvements in practice. One
> > good  example is MariaDB where you frequently find the following
> > 
> > pattern:
> >   struct A {
> >   
> > virtual void foo() { assert(0); }
> >   
> >   };
> >   ...
> >   A *a;
> >   a->foo();
> > 
> > Here the patch will prevent GCC from inlining A::foo (as it'll figure
> > out that it's impossible to occur at runtime) thus saving code size.
> > 
> > Does this approach make sense in general? If it does I can probably
> > come up with more measurements.
> > 
> > As a side note, at least some users may consider this a useful feature:
> > http://www.nntp.perl.org/group/perl.perl5.porters/2013/11/msg209482.html
> 
> You should CC relevant maintainers or annotate the subject -- this is
> a C/C++ frontend patch introducing __builtin_has_side_effects_p
> plus a patch adding a GCC supplied assert.h header.
> 
> Note that from a distribution point of view I wouldn't enable
> assume-asserts for a distro-build given the random behavior of
> __builtin_unreachable in case of assert failure.
> 
One option could be to provide such behaviour as new builtins, to be used for 
GSL implementations of Expects() and Ensures(). See 
https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md 
and https://github.com/Microsoft/GSL/blob/master/gsl/gsl_assert

But I think using the data from asserts should be safe and useful too, though 
there might be problems here and there with assert that only makes sense as 
#ifndef(NDEBUG) builds.

`Allan

Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64

2016-11-27 Thread Allan Sandfeld Jensen

On Sunday 27 November 2016, Marc Glisse wrote:
> On Sat, 26 Nov 2016, Allan Sandfeld Jensen wrote:
> > Use the recently introduced unaligned variant of __m128i and add a
> > similar __m64 and use those to make it clear these two intrinsics
> > require neither 128- bit nor 64-bit alignment.
> 
> Thanks for doing this. You'll want Uros or Kirill to review your patch.
> There are probably several more places that could do with an unaligned
> fix, but we don't have to find them all at once.
> First I found it strange to use __m64, but then it actually seems like a
> good call to use a type that is not just aligned(1) but also may_alias.
> 
> +  *(__m64_u *)__P = __m64(((__v2di)__B)[0]);
> 
> gcc complains about this syntax for me, it wants parentheses around
> __m64... Did it pass the testsuite for you?

Fixed, it now matches the move just below.
Index: gcc/config/i386/emmintrin.h
===
--- gcc/config/i386/emmintrin.h	(revision 242892)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -703,9 +703,9 @@
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_loadl_epi64 (__m128i const *__P)
+_mm_loadl_epi64 (__m128i_u const *__P)
 {
-  return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P);
+  return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P);
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
@@ -721,9 +721,9 @@
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_storel_epi64 (__m128i *__P, __m128i __B)
+_mm_storel_epi64 (__m128i_u *__P, __m128i __B)
 {
-  *(long long *)__P = ((__v2di)__B)[0];
+  *(__m64_u *)__P = (__m64) ((__v2di)__B)[0];
 }
 
 extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
Index: gcc/config/i386/mmintrin.h
===
--- gcc/config/i386/mmintrin.h	(revision 242892)
+++ gcc/config/i386/mmintrin.h	(working copy)
@@ -37,6 +37,9 @@
vector types, and their scalar components.  */
 typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));
 
+/* Unaligned version of the same type  */
+typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1)));
+
 /* Internal data types for implementing the intrinsics.  */
 typedef int __v2si __attribute__ ((__vector_size__ (8)));
 typedef short __v4hi __attribute__ ((__vector_size__ (8)));

[Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64

2016-11-26 Thread Allan Sandfeld Jensen

Use the recently introduced unaligned variant of __m128i and add a similar 
__m64 and use those to make it clear these two intrinsics require neither 128-
bit nor 64-bit alignment.

`Allan
Index: gcc/config/i386/emmintrin.h
===
--- gcc/config/i386/emmintrin.h	(revision 242753)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -703,9 +703,9 @@
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_loadl_epi64 (__m128i const *__P)
+_mm_loadl_epi64 (__m128i_u const *__P)
 {
-  return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P);
+  return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P);
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
@@ -721,9 +721,9 @@
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-_mm_storel_epi64 (__m128i *__P, __m128i __B)
+_mm_storel_epi64 (__m128i_u *__P, __m128i __B)
 {
-  *(long long *)__P = ((__v2di)__B)[0];
+  *(__m64_u *)__P = __m64(((__v2di)__B)[0]);
 }
 
 extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
Index: gcc/config/i386/mmintrin.h
===
--- gcc/config/i386/mmintrin.h	(revision 242753)
+++ gcc/config/i386/mmintrin.h	(working copy)
@@ -37,6 +37,9 @@
vector types, and their scalar components.  */
 typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));
 
+/* Unaligned version of the same type  */
+typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1)));
+
 /* Internal data types for implementing the intrinsics.  */
 typedef int __v2si __attribute__ ((__vector_size__ (8)));
 typedef short __v4hi __attribute__ ((__vector_size__ (8)));

Re: [PATCH] libiberty: Fix -Wimplicit-fallthrough warnings.

2016-11-18 Thread Allan Sandfeld Jensen

On Wednesday 02 November 2016, Mark Wielaard wrote:
> -case 11: c+=((hashval_t)k[10]<<24);
> -case 10: c+=((hashval_t)k[9]<<16);
> -case 9 : c+=((hashval_t)k[8]<<8);
> +case 11: c+=((hashval_t)k[10]<<24);  /* fall through */
> +case 10: c+=((hashval_t)k[9]<<16);   /* fall through */
> +case 9 : c+=((hashval_t)k[8]<<8);/* fall through */
>/* the first byte of c is reserved for the length */

This really highlights another exception -Wimplicit-fallthough should tolerate 
at least on level 1. Single line of code.

case X: my_statement();
case y: my_statement();

and 
case X:
my_statement();
case y:
my_statement();

In both cases, the lack of break is obvious at a glance and thus a comment 
highlighting it has never been needed and shouldn't be enforced.

Well, at least in my opinion, and it would take away all the rest of false 
positives I run into.

Best regards
`Allan

Re: [PATCH] Introduce -Wimplicit-fallthrough={0,1,2,3,4,5}

2016-10-13 Thread Allan Sandfeld Jensen

On Tuesday 11 October 2016, Jakub Jelinek wrote:
> Hi!
> 
> The following patch introduces difference warning levels for
> -Wimplicit-fallthrough warning, so projects can choose if they want to
> honor only attributes (-Wimplicit-fallthrough=5), or what kind of comments.
> =4 is very picky and accepts only very small amount of comments, =3 is what
> we had before this patch, =2 looks case insensitively for falls?[
> \t-]*thr(u|ough) anywhere in the comment, =1 accepts any comment, =0 is
> the same as -Wno-implicit-fallthrough - disables the warning.

I would suggest also looking for comments with "no break" in them, as that is 
another common way to annotate the intentional lack of a 'break'.

If you want another example besides the linux kernel, I unified all our fall 
through comments in qtbase in August: 
https://codereview.qt-project.org/#/c/163595/

Though Qt is far from -Wimpliciit-fallthough clean after that, another 
colleague is working on that since we have traditionally aimed for -Wall -
Wextra -Werror, though it will seriously wreck hawock with readability in 
several places with unrolled loops and switches on integers.

Best regards
`Allan

Re: Constexpr in intrinsics?

2016-03-27 Thread Allan Sandfeld Jensen

On Sunday 27 March 2016, Marc Glisse wrote:
> On Sun, 27 Mar 2016, Allan Sandfeld Jensen wrote:
> > Would it be possible to add constexpr to the intrinsics headers?
> > 
> > For instance _mm_set_XX and _mm_setzero intrinsics.
> 
> Already suggested here:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65197
> 
> A patch would be welcome (I started doing it at some point, I don't
> remember if it was functional, the patch is attached).
> 
That looks very similar to the patch I experimented with, and that at least 
works for using them in C++11 constexpr functions.

> > Ideally it could also be added all intrinsics that can be evaluated at
> > compile time, but it is harder to tell which those are.
> > 
> > Does gcc have a C extension we can use to set constexpr?
> 
> What for?

To have similar functionality in C. For instance to explicitly allow those 
functions to be evaluated at compile time, and values with similar attributes 
be optimized completely out. And of course avoid using precompiler noise, in 
shared C/C++ headers like these are.

Best regards
`Allan

Constexpr in intrinsics?

2016-03-27 Thread Allan Sandfeld Jensen

Would it be possible to add constexpr to the intrinsics headers?

For instance _mm_set_XX and _mm_setzero intrinsics.

Ideally it could also be added all intrinsics that can be evaluated at compile 
time, but it is harder to tell which those are.

Does gcc have a C extension we can use to set constexpr?

Best regards
`Allan

Re: Change to C++11 by default?

2015-05-08 Thread Allan Sandfeld Jensen

On Thursday 07 May 2015, Jason Merrill wrote:
 I think it's time to switch to C++11 as the default C++ dialect for GCC
 6.  Any thoughts?
 
Would it be unrealistic to make C++14 the default? With it being an fixup of 
C++11, I would guess it could have longer staying power as the default.

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-26 Thread Allan Sandfeld Jensen

On Monday 26 January 2015, H.J. Lu wrote:
 On Mon, Jan 26, 2015 at 11:08 AM, Allan Sandfeld Jensen
 
 carew...@gmail.com wrote:
  On Monday 26 January 2015, H.J. Lu wrote:
  On Mon, Jan 26, 2015 at 10:53 AM, Allan Sandfeld Jensen
  
  carew...@gmail.com wrote:
 Committed with a bunch of fixes (e.g. missing fold_builtin_cpu
 part in gcc/config/i386/i386.c, and mv17.C test didn't compile
 at all due to missing parenthesis).
 
 ... and now with committed ChangeLog and patch.
 
 gcc/ChangeLog:
 * config/i386/i386.c (get_builtin_code_for_version): Add
 support for BMI and BMI2 multiversion functions.
 (fold_builtin_cpu): Add F_BMI and F_BMI2.
 
 libgcc/ChangeLog:
 * config/i386/cpuinfo.c (enum processor_features): Add
 FEATURE_BMI and FEATURE_BMI2.
 (get_available_features): Detect FEATURE_BMI and
 FEATURE_BMI2.
 
 testsuite/ChangeLog:
 * gcc.target/i386/funcspec-5.c: Test new multiversion
 targets. * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion
 dispatcher.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9ec40cb..441911d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl,
tree *predica te_list)

 P_PROC_SSE4_A,
 P_SSE4_1,
 P_SSE4_2,

-P_PROC_SSE4_2,

 P_POPCNT,

+P_PROC_SSE4_2,

 P_AVX,
 P_PROC_AVX,

+P_BMI,
+P_PROC_BMI,

 P_FMA4,
 P_XOP,
 P_PROC_XOP,
 P_FMA,
 P_PROC_FMA,

+P_BMI2,

 P_AVX2,
 P_PROC_AVX2,
 P_AVX512F,

This changed the priority of P_POPCNT and caused

FAIL: g++.dg/ext/mv1.C  -std=gnu++98 execution test
FAIL: g++.dg/ext/mv1.C  -std=gnu++11 execution test
FAIL: g++.dg/ext/mv1.C  -std=gnu++14 execution test

on Nehalem and Westmere machines:

mv1.exe:
/export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:
51: int main(): Assertion `val == 5' failed.

since val is 6 now.

Right. I am not sure why popcnt was prioritized below arch=corei7.
The logic is supposed to be that any target that includes an
extension is prioritized
   
   I don't understand your question.  popcnt feature is separate from
   -march. Its priority has nothing to do with -march=corei7.
   
   arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would
   probably work with -march=core2.
   
   AFAIK The logic of the priorities in multiversioning is that
   architecture specific functions are chosen over feature specific,
   unless the feature is one that isn't required by the architecture.
  
  On SSE4.2 machines, we should choose
  
  int __attribute__ ((target(arch=corei7))) foo ();
  
  over
  
  int __attribute__ ((target(popcnt))) foo ();
  
  But we shouldn't choose
  
  int __attribute__ ((target(arch=corei7))) foo ();
  
  over
  
  int __attribute__ ((target(arch=corei7,popcnt))) foo ();
  
  I guess since they represent the exact same effective ISA, they would
  have equal priority, so that it would likely chose whatever comes last.
 
 I have no strong opinion on this.  But this is a user visible compiler
 behavior change.  We should issue a warning/note here.

Yes, or revert that part of the patch. It should have been a separate patch 
anyway.

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-26 Thread Allan Sandfeld Jensen

On Monday 26 January 2015, you wrote:
 On Mon, Jan 26, 2015 at 10:38 AM, Allan Sandfeld Jensen
 
 al...@carewolf.com wrote:
  On Monday 26 January 2015, H.J. Lu wrote:
  On Sun, Jan 25, 2015 at 10:37 AM, Uros Bizjak ubiz...@gmail.com wrote:
   On Sun, Jan 25, 2015 at 7:23 PM, Uros Bizjak ubiz...@gmail.com wrote:
   On Sat, Jan 24, 2015 at 11:49 AM, Allan Sandfeld Jensen
   
   al...@carewolf.com wrote:
   On Saturday 24 January 2015, Uros Bizjak wrote:
   On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com 
wrote:
Hello!

On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen
  
  wrote:
 I recently wanted to use multiversioning for BMI2 specific
 extensions PDEP/PEXT, and noticed it wasn't there. So I wrote
 this patch to add it, and also added AES, F16C and BMI1 for
 completeness.

AES nor F16C doesn't make any sense IMHO for multiversioning,
you need special intrinsics for that anyway and when you use
them, the function will fail to compile without those
features. Multiversioning only makes sense for ISA features
the compiler uses for normal C/C++ code without any
intrinsics.

Patch reduced to just adding BMI and BMI2 multiversioning:
+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+ * config/i386/i386.c (get_builtin_code_for_version): Add
+ support for BMI and BMI2 multiversion functions.

+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+ * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
+ * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.

+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+ * config/i386/cpuinfo.c (enum processor_features): Add
FEATURE_BMI and + FEATURE_BMI2.
+ (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.

OK for mainline
   
   Allan, did you commit the patch to mainline? I don't see it in SVN
   logs.
   
   (If you don't have SVN commit access, please mention it in the
   patch submission, so someone will commit the patch for you).
   
   Sorry. I don't have SVN commit access.
   
   Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part
   in gcc/config/i386/i386.c, and mv17.C test didn't compile at all due
   to missing parenthesis).
   
   ... and now with committed ChangeLog and patch.
   
   gcc/ChangeLog:
   * config/i386/i386.c (get_builtin_code_for_version): Add
   support for BMI and BMI2 multiversion functions.
   (fold_builtin_cpu): Add F_BMI and F_BMI2.
   
   libgcc/ChangeLog:
   * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI
   and FEATURE_BMI2.
   (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.
   
   testsuite/ChangeLog:
   * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
   * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
  
  diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
  index 9ec40cb..441911d 100644
  --- a/gcc/config/i386/i386.c
  +++ b/gcc/config/i386/i386.c
  @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree
  *predica te_list)
  
   P_PROC_SSE4_A,
   P_SSE4_1,
   P_SSE4_2,
  
  -P_PROC_SSE4_2,
  
   P_POPCNT,
  
  +P_PROC_SSE4_2,
  
   P_AVX,
   P_PROC_AVX,
  
  +P_BMI,
  +P_PROC_BMI,
  
   P_FMA4,
   P_XOP,
   P_PROC_XOP,
   P_FMA,
   P_PROC_FMA,
  
  +P_BMI2,
  
   P_AVX2,
   P_PROC_AVX2,
   P_AVX512F,
  
  This changed the priority of P_POPCNT and caused
  
  FAIL: g++.dg/ext/mv1.C  -std=gnu++98 execution test
  FAIL: g++.dg/ext/mv1.C  -std=gnu++11 execution test
  FAIL: g++.dg/ext/mv1.C  -std=gnu++14 execution test
  
  on Nehalem and Westmere machines:
  
  mv1.exe:
  /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51:
  int main(): Assertion `val == 5' failed.
  
  since val is 6 now.
  
  Right. I am not sure why popcnt was prioritized below arch=corei7. The
  logic is supposed to be that any target that includes an extension is
  prioritized
 
 I don't understand your question.  popcnt feature is separate from -march.
 Its priority has nothing to do with -march=corei7.
 
arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would probably 
work with -march=core2.

AFAIK The logic of the priorities in multiversioning is that architecture 
specific functions are chosen over feature specific, unless the feature is one 
that isn't required by the architecture.

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-26 Thread Allan Sandfeld Jensen

On Monday 26 January 2015, H.J. Lu wrote:
 On Mon, Jan 26, 2015 at 10:53 AM, Allan Sandfeld Jensen
 
 carew...@gmail.com wrote:
Committed with a bunch of fixes (e.g. missing fold_builtin_cpu
part in gcc/config/i386/i386.c, and mv17.C test didn't compile at
all due to missing parenthesis).

... and now with committed ChangeLog and patch.

gcc/ChangeLog:
* config/i386/i386.c (get_builtin_code_for_version): Add
support for BMI and BMI2 multiversion functions.
(fold_builtin_cpu): Add F_BMI and F_BMI2.

libgcc/ChangeLog:
* config/i386/cpuinfo.c (enum processor_features): Add
FEATURE_BMI and FEATURE_BMI2.
(get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.

testsuite/ChangeLog:
* gcc.target/i386/funcspec-5.c: Test new multiversion targets.
* g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
   
   diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
   index 9ec40cb..441911d 100644
   --- a/gcc/config/i386/i386.c
   +++ b/gcc/config/i386/i386.c
   @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl,
   tree *predica te_list)
   
P_PROC_SSE4_A,
P_SSE4_1,
P_SSE4_2,
   
   -P_PROC_SSE4_2,
   
P_POPCNT,
   
   +P_PROC_SSE4_2,
   
P_AVX,
P_PROC_AVX,
   
   +P_BMI,
   +P_PROC_BMI,
   
P_FMA4,
P_XOP,
P_PROC_XOP,
P_FMA,
P_PROC_FMA,
   
   +P_BMI2,
   
P_AVX2,
P_PROC_AVX2,
P_AVX512F,
   
   This changed the priority of P_POPCNT and caused
   
   FAIL: g++.dg/ext/mv1.C  -std=gnu++98 execution test
   FAIL: g++.dg/ext/mv1.C  -std=gnu++11 execution test
   FAIL: g++.dg/ext/mv1.C  -std=gnu++14 execution test
   
   on Nehalem and Westmere machines:
   
   mv1.exe:
   /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51:
   int main(): Assertion `val == 5' failed.
   
   since val is 6 now.
   
   Right. I am not sure why popcnt was prioritized below arch=corei7. The
   logic is supposed to be that any target that includes an extension is
   prioritized
  
  I don't understand your question.  popcnt feature is separate from
  -march. Its priority has nothing to do with -march=corei7.
  
  arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would
  probably work with -march=core2.
  
  AFAIK The logic of the priorities in multiversioning is that architecture
  specific functions are chosen over feature specific, unless the feature
  is one that isn't required by the architecture.
 
 On SSE4.2 machines, we should choose
 
 int __attribute__ ((target(arch=corei7))) foo ();
 
 over
 
 int __attribute__ ((target(popcnt))) foo ();
 
 But we shouldn't choose
 
 int __attribute__ ((target(arch=corei7))) foo ();
 
 over
 
 int __attribute__ ((target(arch=corei7,popcnt))) foo ();

I guess since they represent the exact same effective ISA, they would have 
equal priority, so that it would likely chose whatever comes last. 

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-26 Thread Allan Sandfeld Jensen

On Monday 26 January 2015, H.J. Lu wrote:
 On Sun, Jan 25, 2015 at 10:37 AM, Uros Bizjak ubiz...@gmail.com wrote:
  On Sun, Jan 25, 2015 at 7:23 PM, Uros Bizjak ubiz...@gmail.com wrote:
  On Sat, Jan 24, 2015 at 11:49 AM, Allan Sandfeld Jensen
  
  al...@carewolf.com wrote:
  On Saturday 24 January 2015, Uros Bizjak wrote:
  On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com wrote:
   Hello!
   
   On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen 
wrote:
I recently wanted to use multiversioning for BMI2 specific
extensions PDEP/PEXT, and noticed it wasn't there. So I wrote
this patch to add it, and also added AES, F16C and BMI1 for
completeness.
   
   AES nor F16C doesn't make any sense IMHO for multiversioning, you
   need special intrinsics for that anyway and when you use them,
   the function will fail to compile without those features.
   Multiversioning only makes sense for ISA features the compiler
   uses for normal C/C++ code without any intrinsics.
   
   Patch reduced to just adding BMI and BMI2 multiversioning:
   +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
   +
   + * config/i386/i386.c (get_builtin_code_for_version): Add
   + support for BMI and BMI2 multiversion functions.
   
   +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
   +
   + * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
   + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
   
   +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
   +
   + * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI
   and + FEATURE_BMI2.
   + (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.
   
   OK for mainline
  
  Allan, did you commit the patch to mainline? I don't see it in SVN
  logs.
  
  (If you don't have SVN commit access, please mention it in the patch
  submission, so someone will commit the patch for you).
  
  Sorry. I don't have SVN commit access.
  
  Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part in
  gcc/config/i386/i386.c, and mv17.C test didn't compile at all due to
  missing parenthesis).
  
  ... and now with committed ChangeLog and patch.
  
  gcc/ChangeLog:
  * config/i386/i386.c (get_builtin_code_for_version): Add
  support for BMI and BMI2 multiversion functions.
  (fold_builtin_cpu): Add F_BMI and F_BMI2.
  
  libgcc/ChangeLog:
  * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI
  and FEATURE_BMI2.
  (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.
  
  testsuite/ChangeLog:
  * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
  * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
 
 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
 index 9ec40cb..441911d 100644
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree
 *predica te_list)
  P_PROC_SSE4_A,
  P_SSE4_1,
  P_SSE4_2,
 -P_PROC_SSE4_2,
  P_POPCNT,
 +P_PROC_SSE4_2,
  P_AVX,
  P_PROC_AVX,
 +P_BMI,
 +P_PROC_BMI,
  P_FMA4,
  P_XOP,
  P_PROC_XOP,
  P_FMA,
  P_PROC_FMA,
 +P_BMI2,
  P_AVX2,
  P_PROC_AVX2,
  P_AVX512F,
 
 This changed the priority of P_POPCNT and caused
 
 FAIL: g++.dg/ext/mv1.C  -std=gnu++98 execution test
 FAIL: g++.dg/ext/mv1.C  -std=gnu++11 execution test
 FAIL: g++.dg/ext/mv1.C  -std=gnu++14 execution test
 
 on Nehalem and Westmere machines:
 
 mv1.exe:
 /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51: int
 main(): Assertion `val == 5' failed.
 
 since val is 6 now.

Right. I am not sure why popcnt was prioritized below arch=corei7. The logic 
is supposed to be that any target that includes an extension is prioritized 
above that extension. I included the change in the order in i386.c, but not 
the updated test. It is probably better to not change the order in i386.c, or 
at least change that in a separate commit.

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-24 Thread Allan Sandfeld Jensen

On Saturday 24 January 2015, Uros Bizjak wrote:
 On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com wrote:
  Hello!
  
  On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote:
   I recently wanted to use multiversioning for BMI2 specific extensions
   PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add
   it, and also added AES, F16C and BMI1 for completeness.
  
  AES nor F16C doesn't make any sense IMHO for multiversioning, you need
  special intrinsics for that anyway and when you use them, the function
  will fail to compile without those features.
  Multiversioning only makes sense for ISA features the compiler uses for
  normal C/C++ code without any intrinsics.
  
  Patch reduced to just adding BMI and BMI2 multiversioning:
  +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
  +
  + * config/i386/i386.c (get_builtin_code_for_version): Add
  + support for BMI and BMI2 multiversion functions.
  
  +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
  +
  + * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
  + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
  
  +2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
  +
  + * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and
  + FEATURE_BMI2.
  + (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2.
  
  OK for mainline
 
 Allan, did you commit the patch to mainline? I don't see it in SVN logs.
 
 (If you don't have SVN commit access, please mention it in the patch
 submission, so someone will commit the patch for you).
 
Sorry. I don't have SVN commit access.

`Allan

Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning

2015-01-10 Thread Allan Sandfeld Jensen

On Wednesday 31 December 2014, Jakub Jelinek wrote:
 On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote:
  I recently wanted to use multiversioning for BMI2 specific extensions
  PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it,
  and also added AES, F16C and BMI1 for completeness.
 
 AES nor F16C doesn't make any sense IMHO for multiversioning, you need
 special intrinsics for that anyway and when you use them, the function will
 fail to compile without those features.
 Multiversioning only makes sense for ISA features the compiler uses for
 normal C/C++ code without any intrinsics.
 
Patch reduced to just adding BMI and BMI2 multiversioning:

Best regards
`Allan
commit aa5b98407bcc3ae2f1fef70455b89447112c3e2c
Author: Allan Sandfeld Jensen allan.jen...@digia.com
Date:   Fri Dec 26 21:14:01 2014 +0100

BMI and BMI2 multiversion support

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index ff8a5e6..6afd58a 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+	* config/i386/i386.c (get_builtin_code_for_version): Add
+	support for BMI and BMI2 multiversion functions.
+
 2014-12-27  H.J. Lu  hongjiu...@intel.com
 
 	PR target/64409
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d693fdb..9ae9bca 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -34261,15 +34261,18 @@ get_builtin_code_for_version (tree decl, tree *predicate_list)
 P_PROC_SSE4_A,
 P_SSE4_1,
 P_SSE4_2,
-P_PROC_SSE4_2,
 P_POPCNT,
+P_PROC_SSE4_2,
 P_AVX,
 P_PROC_AVX,
+P_BMI,
+P_PROC_BMI,
 P_FMA4,
 P_XOP,
 P_PROC_XOP,
 P_FMA,
 P_PROC_FMA,
+P_BMI2,
 P_AVX2,
 P_PROC_AVX2,
 P_AVX512F,
@@ -34295,12 +34298,14 @@ get_builtin_code_for_version (tree decl, tree *predicate_list)
   {sse4a, P_SSE4_A},
   {ssse3, P_SSSE3},
   {sse4.1, P_SSE4_1},
-  {sse4.2, P_SSE4_2},
   {popcnt, P_POPCNT},
+  {sse4.2, P_SSE4_2},
   {avx, P_AVX},
+  {bmi, P_BMI},
   {fma4, P_FMA4},
   {xop, P_XOP},
   {fma, P_FMA},
+  {bmi2, P_BMI2},
   {avx2, P_AVX2},
   {avx512f, P_AVX512F}
 };
@@ -34395,7 +34400,7 @@ get_builtin_code_for_version (tree decl, tree *predicate_list)
 	  break;
 	case PROCESSOR_BTVER2:
 	  arg_str = btver2;
-	  priority = P_PROC_AVX;
+	  priority = P_PROC_BMI;
 	  break;
 	case PROCESSOR_BDVER1:
 	  arg_str = bdver1;
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index ef6ddcc..5b11622 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+	* gcc.target/i386/funcspec-5.c: Test new multiversion targets.
+	* g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
+
 2014-12-28  H.J. Lu  hongjiu...@intel.com
 
 	* gcc.target/i386/pr57003.c: Skip on x32.
diff --git a/gcc/testsuite/g++.dg/ext/mv17.C b/gcc/testsuite/g++.dg/ext/mv17.C
new file mode 100644
index 000..311f217
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/mv17.C
@@ -0,0 +1,91 @@
+/* Test case to check if Multiversioning works for BMI and BMI2.  */
+/* { dg-do run { target i?86-*-* x86_64-*-* } } */
+/* { dg-options -O2 } */
+
+#include assert.h
+
+// Check BMI feature selection works
+int foo () __attribute__((target(default)));
+int foo () __attribute__((target(bmi)));
+int foo () __attribute__((target(bmi2)));
+
+// Check specialized versions for archs with BMI is chosen over generic BMI versions.
+int bar () __attribute__((target(default)));
+int bar () __attribute__((target(bmi)));
+int bar () __attribute__((target(bmi2)));
+int bar () __attribute__((target(arch=btver2)));
+int bar () __attribute__((target(arch=haswell)));
+
+int main ()
+{
+  int val = foo ();
+
+  if (__builtin_cpu_supports (bmi2))
+assert (val == 2);
+  else if (__builtin_cpu_supports (bmi))
+assert (val == 1);
+  else
+assert (val == 0);
+
+  val = bar ();
+
+  if (__builtin_cpu_is (btver2)
+assert (val == 5);
+  else if (__builtin_cpu_is (haswell))
+assert (val == 6);
+  else if (__builtin_cpu_supports (bmi2))
+assert (val == 2);
+  else if (__builtin_cpu_supports (bmi))
+assert (val == 1);
+  else
+assert (val == 0);
+
+  return 0;
+}
+
+int __attribute__ ((target(default)))
+foo ()
+{
+  return 0;
+}
+
+int __attribute__ ((target(bmi)))
+foo ()
+{
+  return 1;
+}
+int __attribute__ ((target(bmi2)))
+foo ()
+{
+  return 2;
+}
+
+int __attribute__ ((target(default)))
+bar ()
+{
+  return 0;
+}
+
+int __attribute__ ((target(bmi)))
+bar ()
+{
+  return 1;
+}
+int __attribute__ ((target(bmi2)))
+bar ()
+{
+  return 2;
+}
+
+int __attribute__ ((target(arch=btver2)))
+bar ()
+{
+  return 5;
+}
+
+int __attribute__ ((target(arch=haswell)))
+bar ()
+{
+  return 6;
+}
+
diff --git a/gcc/testsuite/gcc.target/i386/funcspec-5.c b/gcc/testsuite/gcc.target/i386/funcspec-5.c
index 269e610..2e316a7

[Patch, i386] Support AES, F16C, BMI and BMI2 targets in multiversioning

2014-12-31 Thread Allan Sandfeld Jensen

I recently wanted to use multiversioning for BMI2 specific extensions 
PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and 
also added AES, F16C and BMI1 for completeness.

Happy new year
`Allan
commit 062c09d45d22302ffbd4f86d88e16a1a0d49cd80
Author: Allan Sandfeld Jensen allan.jen...@digia.com
Date:   Fri Dec 26 21:14:01 2014 +0100

AES, F16C BMI and BMI2 multiversion support

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index ff8a5e6..83f16a5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+   * config/i386/i386.c (get_builtin_code_for_version): Add
+   support for AES, BMI, BMI2 and F16C multiversion functions.
+
 2014-12-27  H.J. Lu  hongjiu...@intel.com
 
PR target/64409
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d693fdb..a1b74dc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -34261,15 +34261,22 @@ get_builtin_code_for_version (tree decl, tree 
*predicate_list)
 P_PROC_SSE4_A,
 P_SSE4_1,
 P_SSE4_2,
-P_PROC_SSE4_2,
 P_POPCNT,
+P_PROC_SSE4_2,
+P_AES,
+P_PROC_AES,
 P_AVX,
 P_PROC_AVX,
+P_F16C,
+P_PROC_F16C,
+P_BMI,
+P_PROC_BMI,
 P_FMA4,
 P_XOP,
 P_PROC_XOP,
 P_FMA,
 P_PROC_FMA,
+P_BMI2,
 P_AVX2,
 P_PROC_AVX2,
 P_AVX512F,
@@ -34295,12 +34302,16 @@ get_builtin_code_for_version (tree decl, tree 
*predicate_list)
   {sse4a, P_SSE4_A},
   {ssse3, P_SSSE3},
   {sse4.1, P_SSE4_1},
-  {sse4.2, P_SSE4_2},
   {popcnt, P_POPCNT},
+  {sse4.2, P_SSE4_2},
+  {aes, P_AES},
   {avx, P_AVX},
+  {f16c, P_F16C},
+  {bmi, P_BMI},
   {fma4, P_FMA4},
   {xop, P_XOP},
   {fma, P_FMA},
+  {bmi2, P_BMI2},
   {avx2, P_AVX2},
   {avx512f, P_AVX512F}
 };
@@ -34350,21 +34361,25 @@ get_builtin_code_for_version (tree decl, tree 
*predicate_list)
  priority = P_PROC_SSSE3;
  break;
case PROCESSOR_NEHALEM:
- if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_AES)
+ if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_AES) {
arg_str = westmere;
- else
+priority = P_PROC_AES;
+  } else {
/* We translate arch=corei7 and arch=nehalem to
   corei7 so that it will be mapped to M_INTEL_COREI7
   as cpu type to cover all M_INTEL_COREI7_XXXs.  */
arg_str = corei7;
- priority = P_PROC_SSE4_2;
+priority = P_PROC_SSE4_2;
+  }
  break;
case PROCESSOR_SANDYBRIDGE:
- if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_F16C)
+ if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_F16C) {
arg_str = ivybridge;
- else
+   priority = P_PROC_F16C;
+ } else {
arg_str = sandybridge;
- priority = P_PROC_AVX;
+   priority = P_PROC_AVX;
+ }
  break;
case PROCESSOR_HASWELL:
  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_ADX)
@@ -34395,7 +34410,7 @@ get_builtin_code_for_version (tree decl, tree 
*predicate_list)
  break;
case PROCESSOR_BTVER2:
  arg_str = btver2;
- priority = P_PROC_AVX;
+ priority = P_PROC_BMI;
  break;
case PROCESSOR_BDVER1:
  arg_str = bdver1;
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index ef6ddcc..5b11622 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+   * gcc.target/i386/funcspec-5.c: Test new multiversion targets.
+   * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher.
+
 2014-12-28  H.J. Lu  hongjiu...@intel.com
 
* gcc.target/i386/pr57003.c: Skip on x32.
diff --git a/gcc/testsuite/g++.dg/ext/mv17.C b/gcc/testsuite/g++.dg/ext/mv17.C
new file mode 100644
index 000..311f217
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/mv17.C
@@ -0,0 +1,91 @@
+/* Test case to check if Multiversioning works for BMI and BMI2.  */
+/* { dg-do run { target i?86-*-* x86_64-*-* } } */
+/* { dg-options -O2 } */
+
+#include assert.h
+
+// Check BMI feature selection works
+int foo () __attribute__((target(default)));
+int foo () __attribute__((target(bmi)));
+int foo () __attribute__((target(bmi2)));
+
+// Check specialized versions for archs with BMI is chosen over generic BMI 
versions.
+int bar () __attribute__((target(default)));
+int bar () __attribute__((target(bmi)));
+int bar () __attribute__((target(bmi2)));
+int bar () __attribute__((target(arch=btver2)));
+int bar () __attribute__((target(arch=haswell)));
+
+int main ()
+{
+  int val = foo ();
+
+  if (__builtin_cpu_supports (bmi2))
+assert (val == 2);
+  else

Re: [Patch, i386] Support AES, F16C, BMI and BMI2 targets in multiversioning

2014-12-31 Thread Allan Sandfeld Jensen

On Wednesday 31 December 2014, Jakub Jelinek wrote:
 On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote:
  I recently wanted to use multiversioning for BMI2 specific extensions
  PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it,
  and also added AES, F16C and BMI1 for completeness.
 
 AES nor F16C doesn't make any sense IMHO for multiversioning, you need
 special intrinsics for that anyway and when you use them, the function will
 fail to compile without those features.
 Multiversioning only makes sense for ISA features the compiler uses for
 normal C/C++ code without any intrinsics.
 
I disagree. You just include the intrinsics and use them. Inside a function 
with the right target rule they will work even if they don't outside. This 
works even without multiversioning, e.g. just specific functions with specific 
targets.

`Allan

Optimizing bit extract

2014-02-14 Thread Allan Sandfeld Jensen

Hello gcc

I have been looking at optimizations of pixel-format conversion recently and 
have noticed that gcc does take advantage of SSE4a extrq, BMI1 bextr TBM 
bextri or BMI2 pext instructions when it could be useful.

As far as I can tell it should not be that hard. A bextr expression can 
typically be recognized as ((x  s)   mask) or ((x  s1))  s2). But I am 
unsure where to do such a matching since the mask needs to have specific form 
to be valid for bextr, so it seems it needs to be done before instruction 
selection.

Secondly the bextr instruction in itself only replace two already fast 
instructions so is very minor (unless extracting variable bit-fields which is 
harder recognize).  The real optimization comes from being able to use pext 
(parallel bit extract), which can implement several bextr expressions in 
parallel. 

So, where would be the right place to implement such instructions. Would it 
make sense to recognize bextr early before we get to i386 code, or would it be 
better to recognize it late. And where do I put such instruction selection 
optimizations?

Motivating example:

unsigned rgb32_to_rgb16(unsigned rgb32) {
unsigned char red = (rgb32  19)  0x1f;
unsigned char green = (rgb32  10)  0x3f;
unsigned char blue = rgb32   0x1f;
   return (red  11) | (green  5) | blue;
}

can be implemented as pext(rgb32, 0x001f3f1f)

Best regards
`Allan Sandfeld

Re: [Patch, i386] Separate Intel processor with expanded ISA

2014-01-26 Thread Allan Sandfeld Jensen

Updated patch with test.
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index ccbea0f..e80c30b 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+	* config/i386/i386.c (get_builtin_code_for_version): Separate
+	Westmere from Nehalem, Ivy Bridge from Sandy Bridge and
+	Broadwell from Haswell. 
+
 2014-01-25  Walter Lee  w...@tilera.com
 
 	* config/tilegx/sync.md (atomic_fetch_sub): Fix negation and
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index cf79486..50f3624 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -31298,18 +31298,27 @@ get_builtin_code_for_version (tree decl, tree *predicate_list)
 	  priority = P_PROC_SSSE3;
 	  break;
 	case PROCESSOR_NEHALEM:
-	  /* We translate arch=corei7 and arch=nehelam to
-		 corei7 so that it will be mapped to M_INTEL_COREI7
-		 as cpu type to cover all M_INTEL_COREI7_XXXs.  */
-	  arg_str = corei7;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_AES)
+		arg_str = westmere;
+	  else
+		/* We translate arch=corei7 and arch=nehelam to
+		   corei7 so that it will be mapped to M_INTEL_COREI7
+		   as cpu type to cover all M_INTEL_COREI7_XXXs.  */
+		arg_str = corei7;
 	  priority = P_PROC_SSE4_2;
 	  break;
 	case PROCESSOR_SANDYBRIDGE:
-	  arg_str = sandybridge;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_F16C)
+		arg_str = ivybridge;
+	  else
+		arg_str = sandybridge;
 	  priority = P_PROC_AVX;
 	  break;
 	case PROCESSOR_HASWELL:
-	  arg_str = haswell;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_ADX)
+		arg_str = broadwell;
+	  else
+		arg_str = haswell;
 	  priority = P_PROC_AVX2;
 	  break;
 	case PROCESSOR_BONNELL:
diff --git a/gcc/testsuite/g++.dg/ext/mv16.C b/gcc/testsuite/g++.dg/ext/mv16.C
new file mode 100644
index 000..4737c79
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/mv16.C
@@ -0,0 +1,69 @@
+// Test that dispatching can choose the right multiversion
+// for Intel CPUs with the same internal GCC processor id
+// but slighly different sets of x86 extensions.
+
+// { dg-do run { target i?86-*-* x86_64-*-* } }
+// { dg-require-ifunc  }
+// { dg-options -O2 }
+
+#include assert.h
+
+int __attribute__ ((target(default)))
+foo ()
+{
+return 0;
+}
+
+int __attribute__ ((target(arch=nehalem)))
+foo ()
+{
+return 4;
+}
+
+int __attribute__ ((target(arch=westmere)))
+foo ()
+{
+return 5;
+}
+
+int __attribute__ ((target(arch=sandybridge)))
+foo ()
+{
+return 8;
+}
+
+int __attribute__ ((target(arch=ivybridge)))
+foo ()
+{
+return 9;
+}
+
+int __attribute__ ((target(arch=haswell)))
+foo ()
+{
+return 12;
+}
+
+int main ()
+{
+int val = foo ();
+
+if (__builtin_cpu_is (nehalem))
+	assert (val == 4);
+else
+if (__builtin_cpu_is (westmere))
+	assert (val == 5);
+else
+if (__builtin_cpu_is (sandybridge))
+	assert (val == 8);
+else
+if (__builtin_cpu_is (ivybridge))
+	assert (val == 9);
+else
+if (__builtin_cpu_is (haswell))
+	assert (val == 12);
+else
+	assert (val == 0);
+  
+return 0;
+}

Re: [Patch, i386] Separate Intel processor with expanded ISA

2014-01-07 Thread Allan Sandfeld Jensen

No comments? 

On Sunday 29 December 2013, Allan Sandfeld Jensen wrote:
 The function dispatcher might currently choose functions declared with
 target(arch=ivybridge) on a Sandy Bridge CPU. This happens because the
 function is only detected as sandybridge when generated. The attached patch
 detects Westmere, Ivybridge and Broadwell processors based on new ISAs they
 support.
 
 Regards
 `Allan

[Patch, i386] Separate Intel processor with expanded ISA

2013-12-28 Thread Allan Sandfeld Jensen

The function dispatcher might currently choose functions declared with 
target(arch=ivybridge) on a Sandy Bridge CPU. This happens because the 
function is only detected as sandybridge when generated. The attached patch 
detects Westmere, Ivybridge and Broadwell processors based on new ISAs they 
support.

Regards
`Allan
Index: gcc/ChangeLog
===
--- gcc/ChangeLog	(revision 206233)
+++ gcc/ChangeLog	(working copy)
@@ -1,3 +1,9 @@
+2013-12-29  Allan Sandfeld Jensen  sandf...@kde.org
+
+	* config/i386/i386.c (get_builtin_code_for_version): Separate
+	Westmere from Nehalem, Ivy Bridge from Sandy Bridge and
+	Broadwell from Haswell. 
+
 2013-12-28  Eric Botcazou  ebotca...@adacore.com
 
 	* doc/invoke.texi (output file options): Document -fada-spec-parent.
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c	(revision 206233)
+++ gcc/config/i386/i386.c	(working copy)
@@ -30030,18 +30005,27 @@
 	  priority = P_PROC_SSSE3;
 	  break;
 	case PROCESSOR_NEHALEM:
-	  /* We translate arch=corei7 and arch=nehelam to
-		 corei7 so that it will be mapped to M_INTEL_COREI7
-		 as cpu type to cover all M_INTEL_COREI7_XXXs.  */
-	  arg_str = corei7;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_AES)
+		arg_str = westmere;
+	  else
+		/* We translate arch=corei7 and arch=nehelam to
+		   corei7 so that it will be mapped to M_INTEL_COREI7
+		   as cpu type to cover all M_INTEL_COREI7_XXXs.  */
+		arg_str = corei7;
 	  priority = P_PROC_SSE4_2;
 	  break;
 	case PROCESSOR_SANDYBRIDGE:
-	  arg_str = sandybridge;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_F16C)
+		arg_str = ivybridge;
+	  else
+		arg_str = sandybridge;
 	  priority = P_PROC_AVX;
 	  break;
 	case PROCESSOR_HASWELL:
-	  arg_str = haswell;
+	  if (new_target-x_ix86_isa_flags  OPTION_MASK_ISA_ADX)
+		arg_str = broadwell;
+	  else
+		arg_str = haswell;
 	  priority = P_PROC_AVX2;
 	  break;
 	case PROCESSOR_BONNELL:

Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-26 Thread Allan Sandfeld Jensen

On Thursday 26 December 2013, Gopalasubramanian, Ganesh wrote:
 Hi,
 
  (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and
  AMDFAM15H_BDVER3.
 
 As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead
 of using BOBCAT or JAGUAR. Attached patch does the changes.
 
Sorry for missing your comment. Thanks for fixing it. Renaming the comments 
with the AMD family names might be overdoing it though. 

`Allan

1 2 >

1 - 100 of 125 matches

Mail list logo