Re: GCC and Floating-Point

2005-05-25 Thread Allan Sandfeld Jensen
On Wednesday 25 May 2005 16:22, chris jefferson wrote:
>
> On the other hand, in general using != and == on floating point numbers
> is always dangerous if you do not know all the consequences. For
> example, on your above program if I use 30.1 and 90.3, the program fails
> without -ffast-math.
>
Yes. I still don't understand why gcc doesn't do -ffast-math by default like 
all other compilers. The people who needs perfect standard behavior are a lot 
fewer than all the packagers who doesn't understand which optimization flags 
gcc should _always_ be called with.

`Allan


Re: GCC and Floating-Point

2005-05-26 Thread Allan Sandfeld Jensen
On Thursday 26 May 2005 10:15, Vincent Lefevre wrote:
> On 2005-05-25 19:27:21 +0200, Allan Sandfeld Jensen wrote:
> > Yes. I still don't understand why gcc doesn't do -ffast-math by
> > default like all other compilers.
>
> No! And I really don't think that other compilers do that.

I can't speak of all compilers, only the ones I've tried. ICC enables it 
always, Sun CC, Dec CXX, and HP CC at certain levels of optimizations 
(equivalent to -O2). 

Basically any compiler that cares about benchmarks have it enabled by default.

Many of them however have multiple levels of relaxed floating point. The 
lowest levels will try to be as accurate as possible, while the higher will 
loosen the accuracy and just try to be as fast as possible.

>
> > The people who needs perfect standard behavior are a lot fewer than
> > all the packagers who doesn't understand which optimization flags
> > gcc should _always_ be called with.
>
> Standard should be the default.
>
> (Is this a troll or what?)

So why isn't -ansi or -pendantic default?



`Allan



Re: GCC and Floating-Point

2005-05-27 Thread Allan Sandfeld Jensen
On Friday 27 May 2005 13:51, Vincent Lefevre wrote:
> So, yes, -ffast-math by default would really be a bad idea and would
> make gcc much worse than other compilers.
>
Thanks for the tests. I had no idea GCCs fast-math was that different from  
other compilers.

 Maybe the real goal like other have mentioned should be to divide the 
-ffast-math into multiple switches.

`Allan


Re: GCC Multi-Threading Ideas

2020-01-23 Thread Allan Sandfeld Jensen
On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> Greetings All,
> 
> Unfortunately due to me being rather busy with school and other things I
> will not be able to post my article to the wiki for awhile. However
> there is a  rough draft here:
> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9gUMj
> Oxk/edit that may change a little for people to read in the meantime.
> 
This comment might not be suited for your project, but now that I think about 
it: If we want to improve gcc toolchain buildspeed with better multithreading. 
I think the most sensible would be fixing up gold multithreading and enabling 
it by default. We already get most of the benefits of multicore architectures 
by running multiple compile jobs in parallel (yes, I know you are focusing on 
cases where that for some reason doesn't work, but it is still the case in 
most situations). The main bottleneck is linking. The code is even already 
there in gold and have been for years, it just haven't been deemed ready for 
being enabled by default.

Is anyone even working on that?

Best regards
Allan




Re: GCC Multi-Threading Ideas

2020-01-24 Thread Allan Sandfeld Jensen
On Freitag, 24. Januar 2020 04:38:48 CET Nicholas Krause wrote:
> On 1/23/20 12:19 PM, Nicholas Krause wrote:
> > On 1/23/20 3:39 AM, Allan Sandfeld Jensen wrote:
> >> On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> >>> Greetings All,
> >>> 
> >>> Unfortunately due to me being rather busy with school and other
> >>> things I
> >>> will not be able to post my article to the wiki for awhile. However
> >>> there is a  rough draft here:
> >>> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9
> >>> gUMj
> >>> 
> >>> Oxk/edit that may change a little for people to read in the meantime.
> >> 
> >> This comment might not be suited for your project, but now that I
> >> think about
> >> it: If we want to improve gcc toolchain buildspeed with better
> >> multithreading.
> >> I think the most sensible would be fixing up gold multithreading and
> >> enabling
> >> it by default. We already get most of the benefits of multicore
> >> architectures
> >> by running multiple compile jobs in parallel (yes, I know you are
> >> focusing on
> >> cases where that for some reason doesn't work, but it is still the
> >> case in
> >> most situations). The main bottleneck is linking. The code is even
> >> already
> >> there in gold and have been for years, it just haven't been deemed
> >> ready for
> >> being enabled by default.
> >> 
> >> Is anyone even working on that?
> >> 
> >> Best regards
> >> Allan
> > 
> > Allan,
> > You would need both depending on the project, some are more compiler
> > bottle necked and others linker. I mentioned that issue at Cauldron as
> > the other side would be the linker.
> > 
> > Nick
> 
> Sorry for the second message Allan but make -j does not scale well
> beyond 4 or
> 8 threads and that's considering a 4 core or 8 machine. 

It doesn't? I generally build with -j100, but then also use icecream to 
distribute builds to multiple machines in the office. That probably also makes 
linking times more significant to my case.

'Allan






Re: GCC Multi-Threading Ideas

2020-01-24 Thread Allan Sandfeld Jensen
On Freitag, 24. Januar 2020 17:29:06 CET Nicholas Krause wrote:
> On 1/24/20 3:18 AM, Allan Sandfeld Jensen wrote:
> > On Freitag, 24. Januar 2020 04:38:48 CET Nicholas Krause wrote:
> >> On 1/23/20 12:19 PM, Nicholas Krause wrote:
> >>> On 1/23/20 3:39 AM, Allan Sandfeld Jensen wrote:
> >>>> On Montag, 20. Januar 2020 20:26:46 CET Nicholas Krause wrote:
> >>>>> Greetings All,
> >>>>> 
> >>>>> Unfortunately due to me being rather busy with school and other
> >>>>> things I
> >>>>> will not be able to post my article to the wiki for awhile. However
> >>>>> there is a  rough draft here:
> >>>>> https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIp
> >>>>> C9
> >>>>> gUMj
> >>>>> 
> >>>>> Oxk/edit that may change a little for people to read in the meantime.
> >>>> 
> >>>> This comment might not be suited for your project, but now that I
> >>>> think about
> >>>> it: If we want to improve gcc toolchain buildspeed with better
> >>>> multithreading.
> >>>> I think the most sensible would be fixing up gold multithreading and
> >>>> enabling
> >>>> it by default. We already get most of the benefits of multicore
> >>>> architectures
> >>>> by running multiple compile jobs in parallel (yes, I know you are
> >>>> focusing on
> >>>> cases where that for some reason doesn't work, but it is still the
> >>>> case in
> >>>> most situations). The main bottleneck is linking. The code is even
> >>>> already
> >>>> there in gold and have been for years, it just haven't been deemed
> >>>> ready for
> >>>> being enabled by default.
> >>>> 
> >>>> Is anyone even working on that?
> >>>> 
> >>>> Best regards
> >>>> Allan
> >>> 
> >>> Allan,
> >>> You would need both depending on the project, some are more compiler
> >>> bottle necked and others linker. I mentioned that issue at Cauldron as
> >>> the other side would be the linker.
> >>> 
> >>> Nick
> >> 
> >> Sorry for the second message Allan but make -j does not scale well
> >> beyond 4 or
> >> 8 threads and that's considering a 4 core or 8 machine.
> > 
> > It doesn't? I generally build with -j100, but then also use icecream to
> > distribute builds to multiple machines in the office. That probably also
> > makes linking times more significant to my case.
> > 
> > 'Allan
> 
> Allan,
> 
> I ran a gcc build on a machine with make -j32 and -j64 that had 64 cores.
> There was literally only a 4 minute increase in build speed. Good question
> through.
> 
Right. I guess it entirely depends on what you are building. If you are 
building gcc, it is probably bound by multiple configure runs, and separate 
iterations. What I usually build is Qt and Chromium, where thousands of files 
can be compiled from a single configure run (more than 2 in the case of 
Chromium), plus those configure runs are much faster. For Chromium there is 
almost a linear speed up with the number of parallel jobs you run up to around 
100. With -j100 I can build Chromium in 10 minutes, with 2 minutes being 
linking time (5 minutes linking if using bfd linker). With -j8 it takes 2 
hours.

But I guess that means multithreading the compiler can make sense to your 
case, even if it doesn't to mine.

Regards
'Allan




Warning on move and dereference of unique_ptr in the same expression

2020-02-03 Thread Allan Sandfeld Jensen
Hello gcc

I have now twice hit obscure bugs in Chromium that crashed on some compilers 
but not on others, and didn't produce any warnings on any compiler. I would 
like to know if this code is as undefined as I think it is, and if it would 
make sense to have gcc warn about it.

Both cases basically has this form:

std::unique_ptr a;

a->b->callMethod(something, bind(callback, std::move(a)));

This crashed with MSVC and gcc 5, but not with newer gcc or with clang.

When it crashes it is because the arguments and the move therein have been 
evaluated before a->b is resolved.

I assume this is undefined behavior? So why isn't the warning for using and 
modifying in the same expression triggered?

Best regards
'Allan




Re: Warning on move and dereference of unique_ptr in the same expression

2020-02-03 Thread Allan Sandfeld Jensen
On Montag, 3. Februar 2020 21:47:13 CET Marek Polacek wrote:
> On Mon, Feb 03, 2020 at 09:26:40PM +0100, Allan Sandfeld Jensen wrote:
> > Hello gcc
> > 
> > I have now twice hit obscure bugs in Chromium that crashed on some
> > compilers but not on others, and didn't produce any warnings on any
> > compiler. I would like to know if this code is as undefined as I think it
> > is, and if it would make sense to have gcc warn about it.
> > 
> > Both cases basically has this form:
> > 
> > std::unique_ptr a;
> > 
> > a->b->callMethod(something, bind(callback, std::move(a)));
> > 
> > This crashed with MSVC and gcc 5, but not with newer gcc or with clang.
> 
> You mean the application itself, not the compiler, presumably.
Of course.

> 
> > When it crashes it is because the arguments and the move therein have been
> > evaluated before a->b is resolved.
> > 
> > I assume this is undefined behavior? So why isn't the warning for using
> > and
> > modifying in the same expression triggered?
> 
> This should be defined in C++17, with P0145 in particular:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0145r3.pdf
> which says that the expression that names the function is sequenced before
> every argument expression and every default argument.
> 
Right thanks, that would explain why it worked consistently with the latest 
gcc versions. I guess it is more of a corner case to ask for it being warned 
about in --std=c++14 mode?

'Allan




Re: New x86-64 micro-architecture levels

2020-07-11 Thread Allan Sandfeld Jensen
On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:
> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
> 
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
> 
> I'll be glad if I can get any feedback on this proposal.  I plan to turn
> it into a merge request for the x86-64 psABI document eventually.
> 
Sounds good, though if I could dream I would also love a partial replacement 
option. So that you could have a generic x86-64 binary that only had some AVX2 
optimized replacement functions in a supplementary library.

Perhaps implemented by marked the library as a partial replacement, so the 
dynamic linker would also load the base or lower libraries except for 
functions already resolved.

You could also add a level E for the AVX512 instructions in ice lake and 
above. The VBMI1/2 instructions would likely be useful for autovectorization 
in GCC.

'Allan




Non-inlined functions and mixed architectures

2020-07-22 Thread Allan Sandfeld Jensen
A problem that I keep running into is functions defined headers, but used in 
sources files that are compiled with different CPU feature flags (for runtime 
CPU feature selection).

We know to make sure the functions are inlinable and their address never 
taken, but of course in debug builds they are still not inlined. Every so 
often the functions get compiled using some of the optional CPU instructions, 
and if the linker selects the optimized versions those instructions can then 
leak through to instances compiled with different CPU flags where the 
instructions aren't supposed to be used. This happens even in unoptimized 
debug builds as the extended instruction selections doesn't count as an 
optimization.

So far the main workaround for gcc has been to mark the functions as 
always_inline.

I have been wondering if you couldn't use the same technique you used for fix 
similar problems for mixed archs for LTO builds and tag shared functions with 
their archs so they don't get merged by linker?

I know the whole thing could technially be seen as an ODR violation, but it 
would still be great if it was something GCC could just handle it out of the 
box.

Alternatively an compile time option to mark non-inline inline functions as 
weak or not generated at all would when compiling certain files would also 
work. 

Best regards
'Allan




Re: Non-inlined functions and mixed architectures

2020-07-27 Thread Allan Sandfeld Jensen
On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > A problem that I keep running into is functions defined headers, but used
> > in sources files that are compiled with different CPU feature flags (for
> > runtime CPU feature selection).
> > 
> > We know to make sure the functions are inlinable and their address never
> > taken, but of course in debug builds they are still not inlined. Every so
> > often the functions get compiled using some of the optional CPU
> > instructions, and if the linker selects the optimized versions those
> > instructions can then leak through to instances compiled with different
> > CPU flags where the instructions aren't supposed to be used. This happens
> > even in unoptimized debug builds as the extended instruction selections
> > doesn't count as an optimization.
> 
> You need to provide source code examples.  This isn't supposed to happen
> if you declare the functions as static inline.  If a function is emitted
> for any reason, it will be local this particular object file.
> 
> Plain inline (for C++) works differently and will attempt to share
> implementations.
> 
static inline? Hadn't thought of that in a shared header file.

Is harder to do with inline methods in C++ classes though.

A recent example I hit into was methods using a qfloat16 class that 
specializes for F16C when available, see https://codereview.qt-project.org/c/
qt/qtbase/+/307772. Which I guess ought to be split into different classes 
with different constructors, so they don't violate ODR rules to be really safe 
across compilers.

But I guess a case like https://codereview.qt-project.org/c/qt/qtbase/+/308163 
could be solved with static inline instead.

Best regards
Allan




Re: Non-inlined functions and mixed architectures

2020-08-04 Thread Allan Sandfeld Jensen
On Montag, 27. Juli 2020 10:54:02 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> >> * Allan Sandfeld Jensen:
> >> > A problem that I keep running into is functions defined headers, but
> >> > used
> >> > in sources files that are compiled with different CPU feature flags
> >> > (for
> >> > runtime CPU feature selection).
> >> > 
> >> > We know to make sure the functions are inlinable and their address
> >> > never
> >> > taken, but of course in debug builds they are still not inlined. Every
> >> > so
> >> > often the functions get compiled using some of the optional CPU
> >> > instructions, and if the linker selects the optimized versions those
> >> > instructions can then leak through to instances compiled with different
> >> > CPU flags where the instructions aren't supposed to be used. This
> >> > happens
> >> > even in unoptimized debug builds as the extended instruction selections
> >> > doesn't count as an optimization.
> >> 
> >> You need to provide source code examples.  This isn't supposed to happen
> >> if you declare the functions as static inline.  If a function is emitted
> >> for any reason, it will be local this particular object file.
> >> 
> >> Plain inline (for C++) works differently and will attempt to share
> >> implementations.
> > 
> > static inline? Hadn't thought of that in a shared header file.
> > 
> > Is harder to do with inline methods in C++ classes though.
> 
> Ahh, and anonymous namespaces (the equivalent for that for member
> functions) do not work in such cases because the representation of the
> class still needs to be shared across API boundaries.  With an anonymous
> namspace, that would be undefined.
> 
So, would it be possible to have a gcc extension or future C++ attribute that 
worked like static on global functions but could be used on member functions 
(both static and otherwise)?

Perhaps make it universal so it did the same no matter where it was used 
instead of being contextual like 'static'.

Best regards
'Allan




Re: Non-inlined functions and mixed architectures

2020-08-04 Thread Allan Sandfeld Jensen
On Dienstag, 4. August 2020 19:44:57 CEST Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > On Montag, 27. Juli 2020 10:54:02 CEST Florian Weimer wrote:
> >> * Allan Sandfeld Jensen:
> >> > On Montag, 27. Juli 2020 10:33:35 CEST Florian Weimer wrote:
> >> >> * Allan Sandfeld Jensen:
> >> >> > A problem that I keep running into is functions defined headers, but
> >> >> > used
> >> >> > in sources files that are compiled with different CPU feature flags
> >> >> > (for
> >> >> > runtime CPU feature selection).
> >> >> > 
> >> >> > We know to make sure the functions are inlinable and their address
> >> >> > never
> >> >> > taken, but of course in debug builds they are still not inlined.
> >> >> > Every
> >> >> > so
> >> >> > often the functions get compiled using some of the optional CPU
> >> >> > instructions, and if the linker selects the optimized versions those
> >> >> > instructions can then leak through to instances compiled with
> >> >> > different
> >> >> > CPU flags where the instructions aren't supposed to be used. This
> >> >> > happens
> >> >> > even in unoptimized debug builds as the extended instruction
> >> >> > selections
> >> >> > doesn't count as an optimization.
> >> >> 
> >> >> You need to provide source code examples.  This isn't supposed to
> >> >> happen
> >> >> if you declare the functions as static inline.  If a function is
> >> >> emitted
> >> >> for any reason, it will be local this particular object file.
> >> >> 
> >> >> Plain inline (for C++) works differently and will attempt to share
> >> >> implementations.
> >> > 
> >> > static inline? Hadn't thought of that in a shared header file.
> >> > 
> >> > Is harder to do with inline methods in C++ classes though.
> >> 
> >> Ahh, and anonymous namespaces (the equivalent for that for member
> >> functions) do not work in such cases because the representation of the
> >> class still needs to be shared across API boundaries.  With an anonymous
> >> namspace, that would be undefined.
> > 
> > So, would it be possible to have a gcc extension or future C++ attribute
> > that worked like static on global functions but could be used on member
> > functions (both static and otherwise)?
> > 
> > Perhaps make it universal so it did the same no matter where it was used
> > instead of being contextual like 'static'.
> 
> One caveat is that things get somewhat interesting if such a function
> returns an object of static or thread storage duration.  In your
> application, you probably want to globalize such objects because they
> are all equivalent.  But there might be other cases where this is
> different.
> 
> vtables are tricky as well, but you probably avoid them in your
> scenario.
> 
Right vtables would be a different story completely. I guess it would only 
make sense for non-virtual inline declared methods, which means a universal 
attribute doesn't make sense.

Application controlled runtime CPU switching with C++ interfaces will remain 
an unreliable hack.

Thanks
Allan







RFC: -fno-share-inlines

2020-08-10 Thread Allan Sandfeld Jensen
Following the previous discussion, this is a proposal for a patch that adds 
the flag -fno-share-inlines that can be used when compiling singular source 
files with a different set of flags than the rest of the project.

It basically turns off comdat for inline functions, as if you compiled without 
support for 'weak' symbols. Turning them all into "static" functions, even if 
that wouldn't normally be possible for that type of function. Not sure if it 
breaks anything, which is why I am not sending it to the patch list.

I also considered alternatively to turn the comdat generation off later during 
assembler production to ensure all processing and optimization of comdat 
functions would occur as normal.

Best regards
Allandiff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 2b1aca16eb4..78e1f592126 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1803,6 +1803,10 @@ frtti
 C++ ObjC++ Optimization Var(flag_rtti) Init(1)
 Generate run time type descriptor information.

+fshare-inlines
+C C++ ObjC ObjC++ Var(flag_share_inlines) Init(1)
+Emit non-inlined inlined declared functions to be shared between object files.
+
 fshort-enums
 C ObjC C++ ObjC++ LTO Optimization Var(flag_short_enums)
 Use the narrowest integer type possible for enumeration types.
diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c
index 33c83773d33..8de796d16fc 100644
--- a/gcc/cp/decl2.c
+++ b/gcc/cp/decl2.c
@@ -1957,10 +1957,9 @@ adjust_var_decl_tls_model (tree decl)
 void
 comdat_linkage (tree decl)
 {
-  if (flag_weak)
-make_decl_one_only (decl, cxx_comdat_group (decl));
-  else if (TREE_CODE (decl) == FUNCTION_DECL
-	   || (VAR_P (decl) && DECL_ARTIFICIAL (decl)))
+  if ((!flag_share_inlines || !flag_weak)
+  && (TREE_CODE (decl) == FUNCTION_DECL
+	  || (VAR_P (decl) && DECL_ARTIFICIAL (decl
 /* We can just emit function and compiler-generated variables
statically; having multiple copies is (for the most part) only
a waste of space.
@@ -1978,6 +1977,8 @@ comdat_linkage (tree decl)
should perform a string comparison, rather than an address
comparison.  */
 TREE_PUBLIC (decl) = 0;
+  else if (flag_weak)
+make_decl_one_only (decl, cxx_comdat_group (decl));
   else
 {
   /* Static data member template instantiations, however, cannot


Re: Peephole optimisation: isWhitespace()

2020-08-17 Thread Allan Sandfeld Jensen
On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> Hi @ll,
> 
> in his ACM queue article ,
> Matt Godbolt used the function
> 
> | bool isWhitespace(char c)
> | {
> | 
> | return c == ' '
> | 
> |   || c == '\r'
> |   || c == '\n'
> |   || c == '\t';
> | 
> | }
> 
> as an example, for which GCC 9.1 emits the following assembly for AMD64
> 
> processors (see ):
> |xoreax, eax  ; result = false
> |cmpdil, 32   ; is c > 32
> |ja .L4   ; if so, exit with false
> |movabs rax, 4294977024   ; rax = 0x12600
> |shrx   rax, rax, rdi ; rax >>= c
> |andeax, 1; result = rax & 1
> |
> |.L4:
> |ret
> 
No it doesn't. As your example shows if you took the time to read it, it is 
what gcc emit when generating code to run on a _haswell_ architecture. If you 
remove -march=haswell from the command line you get:

xor eax, eax
cmp dil, 32
ja  .L1
movabs  rax, 4294977024
mov ecx, edi
shr rax, cl
and eax, 1

It uses one mov more, but no shrx. 

'Allan




Re: RFC: -fno-share-inlines

2020-08-24 Thread Allan Sandfeld Jensen
On Montag, 24. August 2020 08:52:04 CEST Richard Biener wrote:
> On Mon, Aug 10, 2020 at 9:36 AM Allan Sandfeld Jensen
> 
>  wrote:
> > Following the previous discussion, this is a proposal for a patch that
> > adds
> > the flag -fno-share-inlines that can be used when compiling singular
> > source
> > files with a different set of flags than the rest of the project.
> > 
> > It basically turns off comdat for inline functions, as if you compiled
> > without support for 'weak' symbols. Turning them all into "static"
> > functions, even if that wouldn't normally be possible for that type of
> > function. Not sure if it breaks anything, which is why I am not sending
> > it to the patch list.
> > 
> > I also considered alternatively to turn the comdat generation off later
> > during assembler production to ensure all processing and optimization of
> > comdat functions would occur as normal.
> 
> We already have -fvisibility-inlines-hidden so maybe call it
> -fvisibility-inlines-static?
That could make sense.

> Does this option also imply 'static' vtables?
> 
I don't think so. It affects functions declarations. It probably also affect 
virtual functions, but as far as I know the place changed only affects the 
functions not the virtual table. So there could still be a duplicated virtual 
table with the wrong methods, that then depends on linking if it will be used 
globally. I am not sure I dare change that, having different virtual tables 
for the same class depending on where it is allocated sounds like a way to 
break some things.

I wouldn't use virtual tables anyway when dealing with mixed architectures.

Best regards
Allan




Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-27 Thread Allan Sandfeld Jensen
On Freitag, 27. November 2020 00:50:57 CET Jonathan Wakely via Gcc wrote:
> I've touched on the subject a few times, e.g.
> https://gcc.gnu.org/pipermail/gcc/2019-December/230993.html
> and https://gcc.gnu.org/pipermail/gcc/2019-December/231013.html
> 
> Libstdc++ code is indented by 2 columns for the enclosing namespace,
> usually another two for being in a template, and is full of __
> prefixes for reserved names. On top of that, modern C++ declarations
> are *noisy* (template head, requires-clause, noexcept-specifier, often
> 'constexpr' or 'inline' and 'explicit', and maybe some attributes.
> 
> All that gets hard to fit in 80 columns without compromising
> readability with line breaks in unnatural places.
> 
> Does anybody object to raising the line length for libstdc++ code
> (not the rest of GCC) to 100 columns?
> 
If you _do_ change it. I would suggest changing it to 120, which is next 
common step for a lot of C++ projects.

Often also with an allowance for overruns if that makes the code cleaner.

'Allan





Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-29 Thread Allan Sandfeld Jensen
On Sonntag, 29. November 2020 18:38:15 CET Florian Weimer wrote:
> * Allan Sandfeld Jensen:
> > If you _do_ change it. I would suggest changing it to 120, which is next
> > common step for a lot of C++ projects.
> 
> 120 can be problematic for a full HD screen in portrait mode.  Nine
> pixels per character is not a lot (it's what VGA used), and you can't
> have any window decoration.  With a good font and screen, it's doable.
> But if the screen isn't quite sharp, then I think you wouldn't be able
> to use portrait mode anymore.

Using a standard condensed monospace font of 9px, it has a width of 7px, 120 
char would take up 940px fitting two windows in horizontal mode and one in 
vertical. 9px isn't fuzzy, and 8px variants are even narrower.

Sure using square monospace fonts might not fit, but that is an unusual 
configuration and easily worked around by living with a non-square monospace 
font, or accepting occational line overflow. Remember nobody is suggesting 
every line should be that long, just allowing it to allow better structural 
indentation.

'Allan




Re: [RFC] Increase libstdc++ line length to 100(?) columns

2020-11-30 Thread Allan Sandfeld Jensen
On Montag, 30. November 2020 16:47:08 CET Michael Matz wrote:
> Hello,
> 
> On Sun, 29 Nov 2020, Allan Sandfeld Jensen wrote:
> > On Sonntag, 29. November 2020 18:38:15 CET Florian Weimer wrote:
> > > * Allan Sandfeld Jensen:
> > > > If you _do_ change it. I would suggest changing it to 120, which is
> > > > next
> > > > common step for a lot of C++ projects.
> > > 
> > > 120 can be problematic for a full HD screen in portrait mode.  Nine
> > > pixels per character is not a lot (it's what VGA used), and you can't
> > > have any window decoration.  With a good font and screen, it's doable.
> > > But if the screen isn't quite sharp, then I think you wouldn't be able
> > > to use portrait mode anymore.
> > 
> > Using a standard condensed monospace font of 9px, it has a width of 7px,
> > 120
> A char width of 7px implies a cell width of at least 8px (so 960px for 120
> chars), more often of 9px.  With your cell width of 7px your characters
> will be max 6px, symmetric characters will be 5px, which is really small.
> 
I was talking about the full cell width. I tested it before commenting, 
measuring the width in pixels of a line of text.

'Allan




Re: DWZ 0.14 released

2021-03-09 Thread Allan Sandfeld Jensen
Btw, question for gcc/binutils

Any reason the work done by tools like dwz couldn't be done in the compiler or 
linker? Seems a bit odd to have a post-linker that optimizes the generated 
code, when optimizations should already be enabled.

Best regards
Allan

On Montag, 8. März 2021 13:43:11 CET Tom de Vries wrote:
> Hi,
> 
> DWZ 0.14 has been released.
> 
> You can download dwz from the sourceware FTP server here:
> 
> https://sourceware.org/ftp/dwz/releases/
> ftp://sourceware.org/pub/dwz/releases/
> 
> The vital stats:
> 
>   Sizemd5sumName
>   184KiB  cf60e4a65d9cc38c7cdb366e9a29ca8e  dwz-0.14.tar.gz
>   144KiB  1f1225898bd40d63041d54454fcda5b6  dwz-0.14.tar.xz
> 
> There is a web page for DWZ at:
> 
> https://sourceware.org/dwz/
> 
> DWZ 0.14 includes the following changes and enhancements:
> 
> * DWARF 5 support. The tool now handles most of DWARF version 5
>   (at least everything emitted by GCC when using -gdwarf-5).
> 
>   Not yet supported are DW_UT_type units (DWARF 4 .debug_types
>   are supported), .debug_names (.gdb_index is supported) and some
>   forms and sections that are only emitted by GCC when
>   generating Split DWARF (DW_FORM_strx and .debug_str_offsets,
>   DW_FORM_addrx and .debug_addr, DW_FORM_rnglistx and
>   DW_FORM_loclistsx). https://sourceware.org/PR24726
> 
> * .debug_sup support. DWARF Supplementary Object Files
>   (DWARF 5, section 7.3.6) can now be generated when using
>   the --dwarf-5 option. To keep compatibility with existing DWARF
>   consumers this isn't the default yet.
> 
>   Without the --dwarf-5 option instead of a .debug_sup section dwz
>   will generate a .gnu_debugaltlink section and will use
>   DW_FORM_GNU_strp_alt and DW_FORM_GNU_reg_alt, instead of
>   DW_FORM_strp_sup and DW_FORM_ref_sup
> 
> * An experimental optimization has been added that exploits the
>   One-Definition-Rule of C++.  It's enabled using the --odr option, and
>   off by default.  This optimization causes struct/union/class DIEs with
>   the same name to be considered equal.  The optimization can be set to
>   a lower aggressiveness level using --odr-mode=basic, to possibly be
>   able to workaround problems without having to switch off the
>   optimization altogether.
> 
> * The clean-up of temporary files in hardlink mode has been fixed.
> 
> * The DIE limits --low-mem-die-limit  / -l  and
>   --max-die-limit  / -L  can now be disabled using respectively
>   -l none and -L none.  Note that -l none disables the limit, whereas
>   -l 0 sets the limit to zero.
> 
> * The usage message has been:
>   - updated to show that -r and -M are exclusive.
>   - updated to show at -v and -? cannot be combined with other options.
>   - extended to list all options in detail.
>   - restyled to wrap at 80 chars.
> 
> * An option --no-import-optimize was added that switches off an
>   optimization that attempts to reduce the number of
>   DW_TAG_imported_unit DIEs.  This can be used f.i. in case the
>   optimization takes too long.
> 
> * A heuristic has been added that claims more memory earlier (without
>   increasing the peak memory usage) to improve compression time.
> 
> * A heuristic has been added that estimates whether one of the two DIE
>   limits will be hit.  If so, it will do an exact DIE count to verify
>   this.  If the exact DIE count finds that the low-mem DIE limit is
>   indeed hit, processing is done in low-mem mode from the start, rather
>   than processing in regular mode first.  If the exact DIE count finds
>   that the max DIE limit is indeed hit, processing is skipped
>   altogether.
> 
> * Various other performance improvements.
> 
> * A case where previously we would either hit the assertion
>   "dwz: dwz.c:9461: write_die: Assertion `refd != NULL' failed" (in
>   regular mode) or a segmentation fault (in low-mem mode), now is
>   handled by "dwz: Couldn't find DIE at DW_FORM_ref_addr offset 0x".
> 
> * A case where a reference from a partial unit to a compile unit was
>   generated has been fixed.  This could happen if a DIE was referenced
>   using a CU-relative DWARF operator.
> 
> * A case has been fixed for low-mem mode where instead of issuing
>   "dwz: Couldn't find DIE referenced by  DW_OP_GNU_implicit_pointer" dwz
>   would run into a segfault instead.
> 
> * A multi-file case where we run into ".debug_line reference above end
>   of section" has been fixed.
> 
> * The following assertion failures were fixed:
>   - dwz: dwz.c:9310: write_die: Assertion `
>   value && refdcu->cu_kind != CU_ALT
> ' failed.
>   - dwz: dwz.c:9920: recompute_abbrevs: Assertion `
>   off == cu_size
> ' failed.
> 
> * The assert condition of this assertion has been fixed:
>   - write_types: Assertion `ref && ref->die_dup == NULL'.






Constexpr in intrinsics?

2016-03-27 Thread Allan Sandfeld Jensen
Would it be possible to add constexpr to the intrinsics headers?

For instance _mm_set_XX and _mm_setzero intrinsics.

Ideally it could also be added all intrinsics that can be evaluated at compile 
time, but it is harder to tell which those are.

Does gcc have a C extension we can use to set constexpr?

Best regards
`Allan


Re: Constexpr in intrinsics?

2016-03-27 Thread Allan Sandfeld Jensen
On Sunday 27 March 2016, Marc Glisse wrote:
> On Sun, 27 Mar 2016, Allan Sandfeld Jensen wrote:
> > Would it be possible to add constexpr to the intrinsics headers?
> > 
> > For instance _mm_set_XX and _mm_setzero intrinsics.
> 
> Already suggested here:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65197
> 
> A patch would be welcome (I started doing it at some point, I don't
> remember if it was functional, the patch is attached).
> 
That looks very similar to the patch I experimented with, and that at least 
works for using them in C++11 constexpr functions.

> > Ideally it could also be added all intrinsics that can be evaluated at
> > compile time, but it is harder to tell which those are.
> > 
> > Does gcc have a C extension we can use to set constexpr?
> 
> What for?

To have similar functionality in C. For instance to explicitly allow those 
functions to be evaluated at compile time, and values with similar attributes 
be optimized completely out. And of course avoid using precompiler noise, in 
shared C/C++ headers like these are.

Best regards
`Allan


Re: [PING][RFC] Assertions as optimization hints

2016-11-27 Thread Allan Sandfeld Jensen
On Wednesday 23 November 2016, Richard Biener wrote:
> On Tue, Nov 22, 2016 at 6:52 PM, Yuri Gribov  wrote:
> > Hi all,
> > 
> > I've recently revisited an ancient patch from Paolo
> > (https://gcc.gnu.org/ml/gcc-patches/2004-04/msg00551.html) which uses
> > asserts as optimization hints. I've rewritten the patch to be more
> > stable under expressions with side-effects and did some basic
> > investigation of it's efficacy.
> > 
> > Optimization is hidden under !defined NDEBUG && defined
> > __ASSUME_ASSERTS__. !NDEBUG-part is necessary because assertions often
> > rely on special !NDEBUG-protected support code outside of assert
> > (dedicated fields in structures and similar stuff, collectively called
> > "ghost variables"). __ASSUME_ASSERTS__ gives user a choice whether to
> > enable optimization or not (should probably be hidden under a friendly
> > compiler switch e.g. -fassume-asserts).
> > 
> > I do not have access to a good machine for speed benchmarks so I only
> > looked at size improvements in few popular projects. There are no
> > revolutionary changes (0.1%-1%) but some functions see good reductions
> > which may result in noticeable runtime improvements in practice. One
> > good  example is MariaDB where you frequently find the following
> > 
> > pattern:
> >   struct A {
> >   
> > virtual void foo() { assert(0); }
> >   
> >   };
> >   ...
> >   A *a;
> >   a->foo();
> > 
> > Here the patch will prevent GCC from inlining A::foo (as it'll figure
> > out that it's impossible to occur at runtime) thus saving code size.
> > 
> > Does this approach make sense in general? If it does I can probably
> > come up with more measurements.
> > 
> > As a side note, at least some users may consider this a useful feature:
> > http://www.nntp.perl.org/group/perl.perl5.porters/2013/11/msg209482.html
> 
> You should CC relevant maintainers or annotate the subject -- this is
> a C/C++ frontend patch introducing __builtin_has_side_effects_p
> plus a patch adding a GCC supplied assert.h header.
> 
> Note that from a distribution point of view I wouldn't enable
> assume-asserts for a distro-build given the random behavior of
> __builtin_unreachable in case of assert failure.
> 
One option could be to provide such behaviour as new builtins, to be used for 
GSL implementations of Expects() and Ensures(). See 
https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md 
and https://github.com/Microsoft/GSL/blob/master/gsl/gsl_assert

But I think using the data from asserts should be safe and useful too, though 
there might be problems here and there with assert that only makes sense as 
#ifndef(NDEBUG) builds.

`Allan



Re: Change to C++11 by default?

2015-05-08 Thread Allan Sandfeld Jensen
On Thursday 07 May 2015, Jason Merrill wrote:
> I think it's time to switch to C++11 as the default C++ dialect for GCC
> 6.  Any thoughts?
> 
Would it be unrealistic to make C++14 the default? With it being an fixup of 
C++11, I would guess it could have longer staying power as the default.

`Allan


Re: Enabling LTO for target libraries (e.g., libgo, libstdc++)

2019-01-25 Thread Allan Sandfeld Jensen
On Freitag, 25. Januar 2019 07:22:36 CET Nikhil Benesch wrote:
> Does anyone have advice to offer? Has anyone tried convincing the build
> system to compile some of the other target libraries (like libstdc++ or
> libgfortran) with -flto?
> 
Make sure the static versions of the libraries are partially linked before 
being archived so they do not have LTO code in the final libraries.

gcc -r  -o libname.o
ar x libname.a libname.o

Never done it with gcc libraries though.

'Allan




Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-21 Thread Allan Sandfeld Jensen
On Montag, 11. März 2019 10:14:49 CET Jakub Jelinek wrote:
> On Mon, Mar 11, 2019 at 08:49:30AM +, Moritz Strübe wrote:
> > Considering that C11 6.5.7#3 ("If  the  value  of  the  right operand 
> > is  negative  or  is greater than or equal to the width of the promoted
> > left operand, the behavior is undefined.") is not very widely known, as
> > it "normally" just works, inverting the intent is quite unexpected.
> > 
> > Is there any option that would have helped me with this?
> 
> You could build with -fsanitize=undefined, that would tell you at runtime
> you have undefined behavior in your code (if the SingleDiff has bit ever
> 0x20 set).
> 
> The fact that negative or >= bit precision shifts are UB is widely known,
> and even if it wouldn't, for the compiler all the UBs are just UBs, the
> compiler optimizes on the assumption that UB does not happen, so when it
> sees 32-bit int << (x & 32), it can assume x must be 0 at that point,
> anything else is UB.
> 
Hmm, I am curious. How strongly would gcc assume x is 0?

What if you have some expression that is undefined if x is not zero, but x 
really isn't zero and the result is temporarily undefined, but then another 
statement or part of the expression fixes the final result to something 
defined regardless of the intermediate. Would the compiler make assumptions 
that the intermediate value is never undefined, and possibly carry that 
analysed information over into other expressions?

>From having fixed UBSAN warnings, I have seen many cases where undefined 
behavior was performed, but where the code was aware of it and the final 
result of the expression was well defined nonetheless.

'Allan




Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen
On Donnerstag, 21. März 2019 23:31:48 CET Jakub Jelinek wrote:
> On Thu, Mar 21, 2019 at 11:19:54PM +0100, Allan Sandfeld Jensen wrote:
> > Hmm, I am curious. How strongly would gcc assume x is 0?
> 
> If x is not 0, then it is undefined behavior and anything can happen,
> so yes, it can assume x is 0, sometimes gcc does that, sometimes not,
> it is not required to do that.
> 
> > From having fixed UBSAN warnings, I have seen many cases where undefined
> > behavior was performed, but where the code was aware of it and the final
> 
> Any program where it printed something (talking about -fsanitize=undefined,
> not the few sanitizers that go beyond what is required by the language)
> is undefined, period.  It can happen to "work" as some users expect, it can
> crash, it can format your disk or anything else.  There is no well defined
> after a process runs into UB.
> 
That's nonsense and you know it. There are plenty of things that are undefined 
by the C standard that we rely on anyway.

But getting back to the question, well GCC carry such information further, and 
thus break code that is otherwise correct behaving on all known architectures, 
just because the C standard hasn't decided on one of two possible results?

'Allan





Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen
On Freitag, 22. März 2019 11:02:39 CET Andrew Haley wrote:
> On 3/21/19 10:19 PM, Allan Sandfeld Jensen wrote:
> > From having fixed UBSAN warnings, I have seen many cases where undefined
> > behavior was performed, but where the code was aware of it and the final
> > result of the expression was well defined nonetheless.
> 
> Is this belief about undefined behaviour commonplace among C programmers?
> There's nothing in the standard to justify it: any expression which contains
> UB is undefined.

Yes, even GCC uses undefined behavior when it is considered defined for 
specific architecture, whether it be the result of unaligned access, negative 
shifts, etc. There is a lot of the warnings that UBSAN warns about that you 
will find both in GCC itself, the Linux kernel and many other places.

'Allan





Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-22 Thread Allan Sandfeld Jensen
On Freitag, 22. März 2019 14:38:10 CET Andrew Haley wrote:
> On 3/22/19 10:20 AM, Allan Sandfeld Jensen wrote:
> > On Freitag, 22. März 2019 11:02:39 CET Andrew Haley wrote:
> >> On 3/21/19 10:19 PM, Allan Sandfeld Jensen wrote:
> >>> From having fixed UBSAN warnings, I have seen many cases where undefined
> >>> behavior was performed, but where the code was aware of it and the final
> >>> result of the expression was well defined nonetheless.
> >> 
> >> Is this belief about undefined behaviour commonplace among C programmers?
> >> There's nothing in the standard to justify it: any expression which
> >> contains UB is undefined.
> > 
> > Yes, even GCC uses undefined behavior when it is considered defined for
> > specific architecture,
> 
> If it's defined for a specific architecture it's not undefined. Any compiler
> is entitled to do anything with UB, and "anything" includes extending the
> language to make it well defined.

True, but in the context of "things UBSAN warns about", that includes 
architecture specific details.

And isn't unaligned access real undefined behavior that just happens to work 
on x86 (and newer ARM)?

There are also stuff like type-punning unions which is not architecture 
specific, technically undefined, but which GCC explicitly tolerates (and needs 
to since some NEON intrinsics use it).

'Allan






On -march=x86-64

2019-07-11 Thread Allan Sandfeld Jensen
Years ago I discovered Chrome was optimizing with  -march=x86-64, and knowing 
was an undocumented arch that would optimize for K8 I laughed at it and just 
removed that piece of idiocy from our fork of Chromium, so it would be faster 
than upstream. Recently though I noticed phoronix is also using now sometimes 
optimize with -march=x86-64 instead of using, well.. nothing. as they should. 
And checking recent GCC documentation I noticed that in gcc 8 and gcc 9 
documentation you are now documenting -march=x86-64 and calling it a generic 
64-bit processor. So now it is no longer a laughing matter that people mistake 
it for that. 

I would suggest instead of fixing the documentation to say what -march=x86-64 
actually does, that we should perhaps change it to do what people expect and 
make it an alias for generic?

Best regards
'Allan





Re: On -march=x86-64

2019-07-14 Thread Allan Sandfeld Jensen
On Donnerstag, 11. Juli 2019 20:58:04 CEST Allan Sandfeld Jensen wrote:
> Years ago I discovered Chrome was optimizing with  -march=x86-64, and
> knowing was an undocumented arch that would optimize for K8 I laughed at it
> and just removed that piece of idiocy from our fork of Chromium, so it
> would be faster than upstream. Recently though I noticed phoronix is also
> using now sometimes optimize with -march=x86-64 instead of using, well..
> nothing. as they should. And checking recent GCC documentation I noticed
> that in gcc 8 and gcc 9 documentation you are now documenting -march=x86-64
> and calling it a generic 64-bit processor. So now it is no longer a
> laughing matter that people mistake it for that.
> 
> I would suggest instead of fixing the documentation to say what
> -march=x86-64 actually does, that we should perhaps change it to do what
> people expect and make it an alias for generic?
> 
Nevermind, it was fixed back in 2007 when the generic architecture was 
introduced. The arch table is just misleading here.

Best regards
'Allan




Re: C2X Proposal, merge '.' and '->' C operators

2019-12-21 Thread Allan Sandfeld Jensen
On Monday, 16 December 2019 14:51:38 CET J Decker wrote:
> Here's the gist of what I would propose...
> https://gist.github.com/d3x0r/f496d0032476ed8b6f980f7ed31280da
> 
> In C, there are two operators . and -> used to access members of struct and
> union types. These operators are specified such that they are always paired
> in usage; for example, if the left hand expression is a pointer to a struct
> or union, then the operator -> MUST be used. There is no occasion where .
> and -> may be interchanged, given the existing specification.
> 
> It should be very evident to the compiler whether the token before '.' or
> '->' is a pointer to a struct/union or a struct/union, and just build the
> appropriate output.
> 
> The source modification for the compiler is very slight, even depending on
> flag_c2x(that's not it's name).  It ends up changing a lot of existing
> lines, just to change their indentation; but that shouldn't really count
> against 'changed lines'.
> 
> I'm sure, after 4 score and some years ('78-19) that it must surely have
> come up before?  Anyone able to point me to those existing proposals?
> 
What if you operate on a pointer to a pointer to a struct? Should the same 
operator just magically dereference everything until it is a struct?

I disagree with this proposal because separate a thing and a pointer to a 
thing is fundamental to C/C++, and providing short-cuts that confuse the two 
is doing a disservice to anyone that needs to learn it.

Besides isn't this the wrong mailing list for this? 

'Allan




Re: C++17 by default in gcc-8

2017-03-26 Thread Allan Sandfeld Jensen
On Sunday 26 March 2017, Egor Pugin wrote:
> Hi,
> 
> It's a bit early question, but still.
> C++ releases became more predictive and regular.
> What do you think about settings -std=c++17 (or gnu++17) for gcc-8 as
> default c++ mode?
> What is your policy regarding default c++ standards in gcc?

It would make sense in it being the second release with full c++17 support, so 
we could expect the support to be mature in gcc 8.  I am a little sceptical 
personally though, because I just recently got burned by a C++17 change, where 
a number of uses of functor classes broke because pointers to functions with 
noexcept is now a separate type. So code that worked with C++14 is now illegal 
in C++17 and requires doubling the number of specialized functor templates, 
and still isn't fully source compatible with C++14 (though all nice clean code 
will be compatible)

Best regards
`Allan


Re: [RFC] GCC 8 Project proposal: Extensions supporting C Metaprogramming, pseudo-templates

2017-05-09 Thread Allan Sandfeld Jensen
On Tuesday 09 May 2017, Daniel Santos wrote:
> The primary aim is to facilitate high-performance generic C
> libraries for software where C++ is not suitable, but the cost of
> run-time abstraction is unacceptable. A good example is the Linux
> kernel, where the source tree is littered with more than 100 hand-coded
> or boiler-plate (copy, paste and edit) search cores required to use the
> red-black tree library.
That is not a good excuse, they can just use a defined subset of C++. The cost 
of C++ abstractions is zero if you don't use them. 

> 
> To further the usefulness of such techniques, I propose the addition of
> a c-family attribute to declare a parameter, variable (and possibly
> other declarations) as "constprop" or some similar word. The purpose of
> the attribute is to:
> 
> 1.) Emit a warning or error when the value is not optimized away, and
> 2.) Direct various optimization passes to prefer (or force) either
> cloning or inlining of a function with such a parameter.
> 
This I can get more behind, I have wanted a constexpr attribute for C for some 
time. If not for anything else to be able to use it in shared/system headers 
that can be used by both C and C++ and in C++ would be useful in constexpr 
expressions. If you can find a use for it in pure C as well, so much the 
better.

`Allan


Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen
On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > First let me put into some perspective on -Os usage and some history:
> > 1) -Os is not useful for non-embedded users
> > 2) the embedded folks really need the smallest code possible and
> > usually will be willing to afford the performance hit
> > 3) -Os was a mistake for Apple to use in the first place; they used it
> > and then GCC got better for PowerPC to use the string instructions
> > which is why -Oz was added :)
> > 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> > 
> > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > different instructions and encodings.
> > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > big issue.
> > I soon have a need to keep overall (bare-metal) application size down
> > to just 256k.
> > Micro-controllers are places where -Os matters the most.
> > 
> > This comment does not help my application usage.  It rather hurts it
> > and goes against what -Os is really about.  It is not about reducing
> > icache pressure but overall application code size.  I really need the
> > code to fit into a specific size.
> 
> For many applications using -flto does reduce code size more than just
> going from -O2 to -Os.

I added the option to optimize with -Os in Qt, and it gives an average 15% 
reduction in binary size, somtimes as high as 25%. Using lto gives almost the 
same (slightly less), but the two options combine perfectly and using both can 
reduce binary size from 20 to 40%. And that is on a shared library, not even a 
statically linked binary.

Only real minus is that some of the libraries especially QtGui would benefit 
from a auto-vectorization, so it would be nice if there existed an -O3s 
version which vectorized the most obvious vectorizable functions, a few 
hundred bytes for an additional version here and there would do good. 
Fortunately it doesn't too much damage as we have manually vectorized routines 
for to have good performance also on MSVC, if we relied more on auto-
vectorization it would be worse.

`Allan



Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen
On Samstag, 26. August 2017 12:59:06 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 12:40 +0200, Allan Sandfeld Jensen wrote:
> > On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> > > On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > > > First let me put into some perspective on -Os usage and some history:
> > > > 1) -Os is not useful for non-embedded users
> > > > 2) the embedded folks really need the smallest code possible and
> > > > usually will be willing to afford the performance hit
> > > > 3) -Os was a mistake for Apple to use in the first place; they used it
> > > > and then GCC got better for PowerPC to use the string instructions
> > > > which is why -Oz was added :)
> > > > 4) -Os is used heavily by the arm/thumb2 folks in bare metal
> > > > applications.
> > > > 
> > > > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > > > different instructions and encodings.
> > > > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > > > big issue.
> > > > I soon have a need to keep overall (bare-metal) application size down
> > > > to just 256k.
> > > > Micro-controllers are places where -Os matters the most.
> > > > 
> > > > This comment does not help my application usage.  It rather hurts it
> > > > and goes against what -Os is really about.  It is not about reducing
> > > > icache pressure but overall application code size.  I really need the
> > > > code to fit into a specific size.
> > > 
> > > For many applications using -flto does reduce code size more than just
> > > going from -O2 to -Os.
> > 
> > I added the option to optimize with -Os in Qt, and it gives an average 15%
> > reduction in binary size, somtimes as high as 25%. Using lto gives almost
> > the same (slightly less), but the two options combine perfectly and using
> > both can reduce binary size from 20 to 40%. And that is on a shared
> > library, not even a statically linked binary.
> > 
> > Only real minus is that some of the libraries especially QtGui would
> > benefit from a auto-vectorization, so it would be nice if there existed
> > an -O3s version which vectorized the most obvious vectorizable functions,
> > a few hundred bytes for an additional version here and there would do
> > good. Fortunately it doesn't too much damage as we have manually
> > vectorized routines for to have good performance also on MSVC, if we
> > relied more on auto- vectorization it would be worse.
> 
> In that case using profile guided optimizations will help. It will
> optimize cold functions with -Os and hot functions with -O3 (when using
> e.g.: "-flto -O3 -fprofile-use"). Of course you will have to compile
> twice and also collect training data from your library in between.

Yeah. That is just more problematic in practice. Though I do believe we have 
support for it. It is good to know it will automatically upgrade optimizations 
like that. I just wish there was a way to distribute pre-generated arch-
independent training data.

`Allan 



Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen
On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote:
> > On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra  wrote:
> > 
> > Hi all,
> > 
> > At the GNU Cauldron I was inspired by several interesting talks about
> > improving GCC in various ways. While GCC has many great optimizations, a
> > common theme is that its default settings are rather conservative. As a
> > result users are required to enable several additional optimizations by
> > hand to get good code. Other compilers enable more optimizations at -O2
> > (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should
> > do as well.
> 
> There are some nuances to -O2. Please consider -O2 users who wish use it
> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).
> 
> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase
> code size can be skipped from -Os without drastically effecting
> performance.
> 
> This is not the case with GCC where -Os is a size at all costs optimisation
> mode. GCC users option for size not at the expense of speed is to use -O2.
> 
> Clang GCC
> -Oz   ~=  -Os
> -Os   ~=  -O2
> 
No. Clang's -Os is somewhat limited compared to gcc's, just like the clang -Og 
is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in 
clang proper.

'Allan


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen
On Mittwoch, 13. September 2017 15:46:09 CEST Jakub Jelinek wrote:
> On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> > On its own -O3 doesn't add much (some loop opts and slightly more
> > aggressive inlining/unrolling), so whatever it does we
> > should consider doing at -O2 eventually.
> 
> Well, -O3 adds vectorization, which we don't enable at -O2 by default.
> 
Would it be possible to enable basic block vectorization on -O2? I assume that 
doesn't increase binary size since it doesn't unroll loops.

'Allan



Resolving LTO symbols for static library boundary

2018-02-05 Thread Allan Sandfeld Jensen
Hello GCC

In trying to make it possible to use LTO for distro-builds of Qt, I have again 
hit the problem of static libraries. In Qt in general we for LTO rely on a 
library boundary, where LTO gets resolved when generating the library but no 
LTO-symbols are exported in the shared library. This ensure the library has a 
well defined binary compatible interface and gets LTO optimizations within 
each library. For some private libraries we use static libraries however, 
because we don't need binary compatibility, but though we don't need BC 
between Qt versions, the libraries should still be linkable with different gcc 
versions (and with different compilers). However when LTO is enabled the 
static libraries will contain definitions that depend on a single gcc version 
making it unsuitable for distribution.

One solution is to enable fat-lto object files for static libraries but that 
is both a waste of space and compile time, and disables any LTO optimization 
within the library. Ideally I would like to have the static library do LTO 
optimizations internally just like a shared library, but then exported as 
static library.

I suspect this is more of gcc task than ar/ld task, as it basically boils down 
to gcc doing for a static library what it does for shared library, but maybe 
export the result as a single combined .o file, that can then be ar'ed into a 
compatible static library.

Is this possible?

Best regards
'Allan Jensen


Re: Resolving LTO symbols for static library boundary

2018-02-07 Thread Allan Sandfeld Jensen
On Dienstag, 6. Februar 2018 01:01:29 CET Jan Hubicka wrote:
> Dne 2018-02-05 18:44, Richard Biener napsal:
> > On February 5, 2018 12:26:58 PM GMT+01:00, Allan Sandfeld Jensen
> > 
> >  wrote:
> >> Hello GCC
> >> 
> >> In trying to make it possible to use LTO for distro-builds of Qt, I
> >> have again
> >> hit the problem of static libraries. In Qt in general we for LTO rely
> >> on a
> >> library boundary, where LTO gets resolved when generating the library
> >> but no
> >> LTO-symbols are exported in the shared library. This ensure the
> >> library
> >> has a
> >> well defined binary compatible interface and gets LTO optimizations
> >> within
> >> each library. For some private libraries we use static libraries
> >> however,
> >> because we don't need binary compatibility, but though we don't need
> >> BC
> >> 
> >> between Qt versions, the libraries should still be linkable with
> >> different gcc
> >> versions (and with different compilers). However when LTO is enabled
> >> the
> >> static libraries will contain definitions that depend on a single gcc
> >> version
> >> making it unsuitable for distribution.
> >> 
> >> One solution is to enable fat-lto object files for static libraries
> >> but
> >> that
> >> is both a waste of space and compile time, and disables any LTO
> >> optimization
> >> within the library. Ideally I would like to have the static library do
> >> LTO
> >> optimizations internally just like a shared library, but then exported
> >> as
> >> static library.
> >> 
> >> I suspect this is more of gcc task than ar/ld task, as it basically
> >> boils down
> >> to gcc doing for a static library what it does for shared library, but
> >> maybe
> >> export the result as a single combined .o file, that can then be ar'ed
> >> into a
> >> compatible static library.
> >> 
> >> Is this possible?
> > 
> > Hmm. I think you could partially link the static archive contents into
> > a single relocatable object. Or we could add a mode where you do a
> > 1to1 LTO link of the objects and stop at the ltrans object files. You
> > could stuff those into an archive again.
> > 
> > I'm not sure how far Honza got partial LTO linking to work?
> 
> Parital linking of lto .o files into single non-lto .o file should work
> and it will get you cross-module optimization done. The problem is that
> without resolution info compiler needs to assume that all symbols
> exported by object files are possibly referneced by the later
> incremental link and thus the code quality will definitly not be
> comparable with what you get for LTO on final binary or DSO. Still
> should be better than non-lto build.
> I would be curious if it is useful for you in practice.
> 
How would I do that partial link, and what are the requirements?

Best regards
'Allan




Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen
I brought this subject up earlier, and was told to suggest it again for gcc 9, 
so I have attached the preliminary changes.

My studies have show that with generic x86-64 optimization it reduces binary 
size with around 0.5%, and when optimizing for x64 targets with SSE4 or 
better, it reduces binary size by 2-3% on average. The performance changes are 
negligible however*, and I haven't been able to detect changes in compile time 
big enough to penetrate general noise on my platform, but perhaps someone has 
a better setup for that?

* I believe that is because it currently works best on non-optimized code, it 
is better at big basic blocks doing all kinds of things than tightly written 
inner loops.

Anythhing else I should test or report?

Best regards
'Allan


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index beba295bef5..05851229354 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7612,6 +7612,7 @@ also turns on the following optimization flags:
 -fstore-merging @gol
 -fstrict-aliasing @gol
 -ftree-builtin-call-dce @gol
+-ftree-slp-vectorize @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -fcode-hoisting @gol
 -ftree-pre @gol
@@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following 
optimization flags:
 -floop-interchange @gol
 -floop-unroll-and-jam @gol
 -fsplit-paths @gol
--ftree-slp-vectorize @gol
 -fvect-cost-model @gol
 -ftree-partial-pre @gol
 -fpeel-loops @gol
@@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is 
enabled by default at
 @item -ftree-slp-vectorize
 @opindex ftree-slp-vectorize
 Perform basic block vectorization on trees. This flag is enabled by default 
at
-@option{-O3} and when @option{-ftree-vectorize} is enabled.
+@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled.
 
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
diff --git a/gcc/opts.c b/gcc/opts.c
index 33efcc0d6e7..11027b847e8 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -523,6 +523,7 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
@@ -539,7 +540,6 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
-{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC 
},
 { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },





Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen
On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > I brought this subject up earlier, and was told to suggest it again for
> > gcc 9, so I have attached the preliminary changes.
> > 
> > My studies have show that with generic x86-64 optimization it reduces
> > binary size with around 0.5%, and when optimizing for x64 targets with
> > SSE4 or better, it reduces binary size by 2-3% on average. The
> > performance changes are negligible however*, and I haven't been able to
> > detect changes in compile time big enough to penetrate general noise on
> > my platform, but perhaps someone has a better setup for that?
> > 
> > * I believe that is because it currently works best on non-optimized code,
> > it is better at big basic blocks doing all kinds of things than tightly
> > written inner loops.
> > 
> > Anythhing else I should test or report?
> 
> What does it do on other architectures?
> 
> 
I believe NEON would do the same as SSE4, but I can do a check. For 
architectures without SIMD it essentially does nothing.

'Allan




Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-27 Thread Allan Sandfeld Jensen
On Sonntag, 27. Mai 2018 03:23:36 CEST Segher Boessenkool wrote:
> On Sun, May 27, 2018 at 01:25:25AM +0200, Allan Sandfeld Jensen wrote:
> > On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> > > On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > > > I brought this subject up earlier, and was told to suggest it again
> > > > for
> > > > gcc 9, so I have attached the preliminary changes.
> > > > 
> > > > My studies have show that with generic x86-64 optimization it reduces
> > > > binary size with around 0.5%, and when optimizing for x64 targets with
> > > > SSE4 or better, it reduces binary size by 2-3% on average. The
> > > > performance changes are negligible however*, and I haven't been able
> > > > to
> > > > detect changes in compile time big enough to penetrate general noise
> > > > on
> > > > my platform, but perhaps someone has a better setup for that?
> > > > 
> > > > * I believe that is because it currently works best on non-optimized
> > > > code,
> > > > it is better at big basic blocks doing all kinds of things than
> > > > tightly
> > > > written inner loops.
> > > > 
> > > > Anythhing else I should test or report?
> > > 
> > > What does it do on other architectures?
> > 
> > I believe NEON would do the same as SSE4, but I can do a check. For
> > architectures without SIMD it essentially does nothing.
> 
> Sorry, I wasn't clear.  What does it do to performance on other
> architectures?  Is it (almost) always a win (or neutral)?  If not, it
> doesn't belong in -O2, not for the generic options at least.
> 
It shouldnt have any way of making slower code, so it is neutral or a win in 
performance, and similarly in code size, merged instructions means fewer 
instructions.

I never found a benchmark where it really made a measurable difference in 
performance, but I found many large binaries such as Qt or Chromium, where it 
made the binaries a few percent smaller.

Allan




Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-28 Thread Allan Sandfeld Jensen
On Montag, 28. Mai 2018 12:58:20 CEST Richard Biener wrote:
> compile-time effects of the patch on that. Embedded folks may want to rhn
> their favorite benchmark and report results as well.
> 
> So I did a -O2 -march=haswell [-ftree-slp-vectorize] SPEC CPU 2006 compile
> and run and the compile-time
> effect where measurable (SPEC records on a second granularity) is within
> one second per benchmark
> apart from 410.bwaves (from 3s to 5s)  and 481.wrf (76s to 78s).
> Performance-wise I notice significant
> slowdowns for SPEC FP and some for SPEC INT (I only did a train run
> sofar).  I'll re-run with ref input now
> and will post those numbers.
> 
If you continue to see slowdowns, could you check with either no avx, or with 
-mprefer-avx128? The occational AVX256 instructions might be downclocking the 
CPU. But yes that would be a problem for this change on its own.

'Allan




Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-29 Thread Allan Sandfeld Jensen
On Dienstag, 29. Mai 2018 16:57:56 CEST Richard Biener wrote:
>
> so the situation improves but isn't fully fixed (STLF issues maybe?)
> 

That raises the question if it helps in these cases even in -O3? 

Anyway it doesn't look good for it. Did the binary size at least improve with 
prefer-avx128, or was that also worse or insignificant?


'Allan




Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-31 Thread Allan Sandfeld Jensen
Okay, I think I can withdraw the suggestion. It is apparently not providing a 
stable end performance.

I would like to end with sharing the measurements I made that motivated me to 
suggest the change. Hopefully it can be useful if tree-slp-vectorize gets 
improved and the suggestion comes up again.

As I said previously, the benchmarks I ran were not affected, probably because 
most thing we benchmark in Qt often is hand-optimized already, but the binary 
size with tree-slp-vectorize was on average a one or two percent smaller, 
though was not universal, and many smaller libraries were unaffected.



gcc-8 version 8.1.0 (Debian 8.1.0-4)
gcc-7 version 7.3.0 (Debian 7.3.0-19)

Qt version 5.11.0 (edited to override selective use of -O3)

  size  library

g++-7 -march=corei7 -O2
 8015632 libQt5Widgets.so.5.11.0
 6194288 libQt5Gui.so.5.11.0
  760016 libQt5DBus.so.5.11.0
 5603160 libQt5Core.so.5.11.0

g++-7 -march=corei7 -O2 -ftree-slp-vectorize
 8007440 libQt5Widgets.so.5.11.0
 6182000 libQt5Gui.so.5.11.0
  760016 libQt5DBus.so.5.11.0
 5603224 libQt5Core.so.5.11.0

g++-8 -O2
 8062520 libQt5Widgets.so.5.11.0
 6232160 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5848528 libQt5Core.so.5.11.0

g++-8 -O2 -ftree-slp-vectorize
 8058424 libQt5Widgets.so.5.11.0
 6219872 libQt5Gui.so.5.11.0
  769680 libQt5DBus.so.5.11.0
 5844560 libQt5Core.so.5.11.0

g++-8 -march=corei7 -O2
 8062520 libQt5Widgets.so.5.11.0
 6215584 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 580 libQt5Core.so.5.11.0

g++-8 -march=corei7 -O2 -ftree-slp-vectorize
 8046136 libQt5Widgets.so.5.11.0
 6191008 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5840472 libQt5Core.so.5.11.0

g++-8 -march=haswell -O2
 8046136 libQt5Widgets.so.5.11.0
 6170408 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5852448 libQt5Core.so.5.11.0

g++-8 -march=haswell -O2 -ftree-slp-vectorize
 8046136 libQt5Widgets.so.5.11.0
 6158120 libQt5Gui.so.5.11.0
  765584 libQt5DBus.so.5.11.0
 5848480 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os
 6990368 libQt5Widgets.so.5.11.0
 5030616 libQt5Gui.so.5.11.0
  624160 libQt5DBus.so.5.11.0
 4847056 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -ftree-slp-vectorize
 6986272 libQt5Widgets.so.5.11.0
 5018328 libQt5Gui.so.5.11.0
  624160 libQt5DBus.so.5.11.0
 4847120 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -flto
 6785760 libQt5Widgets.so.5.11.0
 4844464 libQt5Gui.so.5.11.0
  593488 libQt5DBus.so.5.11.0
 4688432 libQt5Core.so.5.11.0

g++-8 -march=haswell -Os -flto -ftree-slp-vectorize
 6777568 libQt5Widgets.so.5.11.0
 4836272 libQt5Gui.so.5.11.0
  593488 libQt5DBus.so.5.11.0
 4688472 libQt5Core.so.5.11.0





Re: O2 Agressive Optimisation by GCC

2018-07-20 Thread Allan Sandfeld Jensen
On Freitag, 20. Juli 2018 14:19:12 CEST Umesh Kalappa wrote:
> Hi All ,
> 
> We are looking at the C sample i.e
> 
> extern int i,j;
> 
> int test()
> {
> while(1)
> {   i++;
> j=20;
> }
> return 0;
> }
> 
> command used :(gcc 8.1.0)
> gcc -S test.c -O2
> 
> the generated asm for x86
> 
> .L2:
> jmp .L2
> 
> we understand that,the infinite loop is not  deterministic ,compiler
> is free to treat as that as UB and do aggressive optimization ,but we
> need keep the side effects like j=20 untouched by optimization .
> 
> Please note that using the volatile qualifier for i and j  or empty
> asm("") in the while loop,will stop the optimizer ,but we don't want
> do  that.
> 
But you need to do that! If you want changes to a variable to be observable in 
another thread, you need to use either volatile, atomic, or some kind of 
memory barrier implicit or explicit. This is the same if the loop wasn't 
infinite, the compiler would keep the value in register during the loop and 
only write it to memory on exiting the test() function.

'Allan




Re: O2 Agressive Optimisation by GCC

2018-07-20 Thread Allan Sandfeld Jensen
On Samstag, 21. Juli 2018 00:21:48 CEST Jonathan Wakely wrote:
> On Fri, 20 Jul 2018 at 23:06, Allan Sandfeld Jensen wrote:
> > On Freitag, 20. Juli 2018 14:19:12 CEST Umesh Kalappa wrote:
> > > Hi All ,
> > > 
> > > We are looking at the C sample i.e
> > > 
> > > extern int i,j;
> > > 
> > > int test()
> > > {
> > > while(1)
> > > {   i++;
> > > 
> > > j=20;
> > > 
> > > }
> > > return 0;
> > > }
> > > 
> > > command used :(gcc 8.1.0)
> > > gcc -S test.c -O2
> > > 
> > > the generated asm for x86
> > > 
> > > .L2:
> > > jmp .L2
> > > 
> > > we understand that,the infinite loop is not  deterministic ,compiler
> > > is free to treat as that as UB and do aggressive optimization ,but we
> > > need keep the side effects like j=20 untouched by optimization .
> > > 
> > > Please note that using the volatile qualifier for i and j  or empty
> > > asm("") in the while loop,will stop the optimizer ,but we don't want
> > > do  that.
> > 
> > But you need to do that! If you want changes to a variable to be
> > observable in another thread, you need to use either volatile,
> 
> No, volatile doesn't work for that.
> 
It does, but you shouldn't use for that due to many other reasons (though the 
linux kernel still does) But if the guy wants to code primitive without using 
system calls or atomics, he might as well go traditional

'Allan




Re: O2 Agressive Optimisation by GCC

2018-07-22 Thread Allan Sandfeld Jensen
On Sonntag, 22. Juli 2018 17:01:29 CEST Umesh Kalappa wrote:
> Allan ,
> 
> >>he might as well go traditional
> 
> you mean using the locks ?
> 

No I am meant relying on undefined behavior. In your case I would recommend 
using modern atomics, which is defined behavior, and modern and fast. I was 
just reminded of all the nasty and theoretically wrong ways we used to do 
stuff like that 20 years ago to implement fallback locks. For instance using -
O0, asm-declarations, relying on non-inlined functions calls as memory-
barriers, etc. All stuff that "worked", but relied on various degrees of 
undefined behavior.

Still if you are curious, it might be fun playing with stuff like that, and 
try to figure for yourself why it works, just remember it is undefined 
behavior and therefore not recommended.

'Allan




Vectorizing 16bit signed integers

2009-12-11 Thread Allan Sandfeld Jensen
Hi

I hope someone can help me. I've been trying to write some tight integer loops 
in way that could be auto-vectorized, saving me to write assembler or using 
specific vectorization extensions. Unfortunately I've not yet managed to make 
gcc vectorize any of them. 

I've simplified the case to just perform the very first operation in the loop; 
converting from two's complement to sign-and-magnitude.

I've then used -ftree-vectorizer-verbose to examine if and if not, why not the 
loops were not vectorized, but I am afraid I don't understand the output.

The simplest version of the loop is here (it appears the branch is not a 
problem, but I have another version without).

inline uint16_t transsign(int16_t v) {
if (v<0) {
return 0x8000U | (1-v);
} else {
return v;
}
}

It very simply converts in a fashion that maintains the full effective bit-
width.

The error from the vectorizer is:
vectorizesign.cpp:42: note: not vectorized: relevant stmt not supported: 
v.1_16 = (uint16_t) D.2157_11;

It appears the unsupported operation in vectorization is the typecast from 
int16_t to uint16_t, can this really be the case, or is the output misleading?

If it is the case, then is there good reason for it, or can I fix it myself by 
adding additional vectorizable operations?

I've attached both test case and full output of ftree-vectorized-verbose=9

Best regards
`Allan

#include 

inline uint16_t transsign1(int16_t v) {
// written with no control-flow to facilitate auto-vectorization
uint16_t sv = v >> 15; // signed left-shift gives a classic sign selector -1 or 0
sv = sv & 0x7FFFU; // never invert the sign-bit
return v ^ sv; // conditional invertion by xor
}

inline uint16_t transsign2(int16_t v) {
if (v<0) {
return 0x8000U | ~v;
} else {
return v;
}
}

inline uint16_t transsign3(int16_t v) {
if (v<0) {
return 0x8000U | (1-v);
} else {
return v;
}
}

// candidate for vectorizaton
void convertts1(uint16_t* out, int16_t* in, uint32_t len) {
for(unsigned int i=0;igcc: 2: No such file or directory

vectorizesign.cpp:28: note: = analyze_loop_nest =
vectorizesign.cpp:28: note: === vect_analyze_loop_form ===
vectorizesign.cpp:28: note: split exit edge.
vectorizesign.cpp:28: note: === get_loop_niters ===
vectorizesign.cpp:28: note: ==> get_loop_niters:len_3(D)
vectorizesign.cpp:28: note: Symbolic number of iterations is len_3(D)
vectorizesign.cpp:28: note: === vect_analyze_data_refs ===

vectorizesign.cpp:28: note: get vectype with 8 units of type short int
vectorizesign.cpp:28: note: vectype: vector short int
vectorizesign.cpp:28: note: get vectype with 8 units of type short unsigned int
vectorizesign.cpp:28: note: vectype: vector short unsigned int
vectorizesign.cpp:28: note: === vect_analyze_scalar_cycles ===
vectorizesign.cpp:28: note: Analyze phi: i_16 = PHI 

vectorizesign.cpp:28: note: Access function of PHI: {0, +, 1}_1
vectorizesign.cpp:28: note: step: 1,  init: 0
vectorizesign.cpp:28: note: Detected induction.
vectorizesign.cpp:28: note: Analyze phi: SMT.12_27 = PHI 

vectorizesign.cpp:28: note: === vect_pattern_recog ===
vectorizesign.cpp:28: note: vect_is_simple_use: operand i_16
vectorizesign.cpp:28: note: def_stmt: i_16 = PHI 

vectorizesign.cpp:28: note: type of def: 4.
vectorizesign.cpp:28: note: === vect_mark_stmts_to_be_vectorized ===
vectorizesign.cpp:28: note: init: phi relevant? i_16 = PHI 

vectorizesign.cpp:28: note: init: phi relevant? SMT.12_27 = PHI 

vectorizesign.cpp:28: note: init: stmt relevant? D.2120_5 = i_16 * 2;

vectorizesign.cpp:28: note: init: stmt relevant? D.2121_7 = out_6(D) + D.2120_5;

vectorizesign.cpp:28: note: init: stmt relevant? D.2122_10 = in_9(D) + D.2120_5;

vectorizesign.cpp:28: note: init: stmt relevant? D.2123_11 = *D.2122_10;

vectorizesign.cpp:28: note: init: stmt relevant? D.2124_12 = (int) D.2123_11;

vectorizesign.cpp:28: note: init: stmt relevant? D.2170_17 = D.2124_12 >> 15;

vectorizesign.cpp:28: note: init: stmt relevant? sv_18 = (uint16_t) D.2170_17;

vectorizesign.cpp:28: note: init: stmt relevant? sv_19 = sv_18 & 32767;

vectorizesign.cpp:28: note: init: stmt relevant? sv.0_20 = (short int) sv_19;

vectorizesign.cpp:28: note: init: stmt relevant? D.2167_21 = sv.0_20 ^ 
D.2123_11;

vectorizesign.cpp:28: note: init: stmt relevant? D.2166_22 = (uint16_t) 
D.2167_21;

vectorizesign.cpp:28: note: init: stmt relevant? *D.2121_7 = D.2166_22;

vectorizesign.cpp:28: note: vec_stmt_relevant_p: stmt has vdefs.
vectorizesign.cpp:28: note: mark relevant 4, live 0.
vectorizesign.cpp:28: note: init: stmt relevant? i_14 = i_16 + 1;

vectorizesign.cpp:28: note: init: stmt relevant? if (len_3(D) > i_14)

vectorizesign.cpp:28: note: worklist: examine stmt: *D.2121_7 = D.2166_22;

vectorizesign.cpp:28: note: vect_is_simple_use: operand D.2166_22
vectorizesign.cpp:28: note: def_stmt: D.2166_22 = (uint16_t) D.2167_21;

vectorizesign.cpp:28: note: type of def: 3.
vectorizes

Re: Summer of Code 2006

2006-04-18 Thread Allan Sandfeld Jensen
On Monday 17 April 2006 12:21, Jan-Benedict Glaw wrote:
> On Sun, 2006-04-16 21:30:08 -0700, Ian Lance Taylor  wrote:
>   * Trailing whitespace patrol.
find . -name "*\.[ch]" | xargs perl -pi -e's/\s*$/\n/'

`Allan


Non-consistent ICE in 14.1 and 14.2

2024-08-29 Thread Allan Sandfeld Jensen
Hi GCC

I wanted to report one or more bugs, unfortunately they are not consistently 
reproducable, which is odd. It happens when compiling the chromium part of 
qtwebengine after the update to gcc 14 and during development for updating 
Chromium to 126. On almost every run over a few thousand files one or more 
files will crash with an ICE, that goes away if you just build again. I was 
under the impression gcc was doing everything reproducable, so this is really 
confusing.

The errors claim different random spots in the sources with either "internal 
compiler error: Segmentation fault" or "internal compiler error: in 
tree_node_structure_for_code, at tree:527"

How should I approach reporting this to get it fixed?

Best regards
Allan




Re: Non-consistent ICE in 14.1 and 14.2

2024-08-29 Thread Allan Sandfeld Jensen
On Thursday 29 August 2024 14:38:04 Central European Summer Time Alexander 
Monakov wrote:
> On Thu, 29 Aug 2024, Richard Biener via Gcc wrote:
> > On Thu, Aug 29, 2024 at 1:49 PM Allan Sandfeld Jensen
> > 
> >  wrote:
> > > Hi GCC
> > > 
> > > I wanted to report one or more bugs, unfortunately they are not
> > > consistently reproducable, which is odd. It happens when compiling the
> > > chromium part of qtwebengine after the update to gcc 14 and during
> > > development for updating Chromium to 126. On almost every run over a
> > > few thousand files one or more files will crash with an ICE, that goes
> > > away if you just build again. I was under the impression gcc was doing
> > > everything reproducable, so this is really confusing.
> > > 
> > > The errors claim different random spots in the sources with either
> > > "internal compiler error: Segmentation fault" or "internal compiler
> > > error: in tree_node_structure_for_code, at tree:527"
> > > 
> > > How should I approach reporting this to get it fixed?
> > 
> > It sounds you might got faulty memory in your system.
> 
> Allan, what CPU is that on? If Intel 13th or 14th gen, that's not entirely
> unexpected, unfortunately, due to voltage management issues (or
> manufacturing, on some earlier samples).
> 
It is an AMD Ryzen 9 5950X, so not one that is built to fail.. I will check 
for hardware flaws anyway though, and report back if this happens on different 
hardware.

Best regards
Allan





Re: Macro for C++14 support

2013-04-23 Thread Allan Sandfeld Jensen
On Sunday 21 April 2013, Jonathan Wakely wrote:
> I'm starting to implement some new library features voted into C++14
> at the Bristol meeting and am wondering what feature check to use.
> 
> Will there be a macro like _GXX_EXPERIMENTAL_CXX1Y__ to correspond to
> -std=c++1y?
> 
> Alternatively we could set the value of __cplusplus to 201400L but I'm
> not sure that's strictly allowed.

Isn't C++14 only an update of the standard library not the language, and 
should that affect how GCC treats it?

If that is the case (I could have missed something). Would it be possible to 
include it under C++11 support instead of having to have users update their 
GCC switches just to link to a libstdc++ with slightly more features?

Best regards
`Allan


Optimizing bit extract

2014-02-14 Thread Allan Sandfeld Jensen
Hello gcc

I have been looking at optimizations of pixel-format conversion recently and 
have noticed that gcc does take advantage of SSE4a extrq, BMI1 bextr TBM 
bextri or BMI2 pext instructions when it could be useful.

As far as I can tell it should not be that hard. A bextr expression can 
typically be recognized as ((x >> s)  & mask) or ((x << s1)) >> s2). But I am 
unsure where to do such a matching since the mask needs to have specific form 
to be valid for bextr, so it seems it needs to be done before instruction 
selection.

Secondly the bextr instruction in itself only replace two already fast 
instructions so is very minor (unless extracting variable bit-fields which is 
harder recognize).  The real optimization comes from being able to use pext 
(parallel bit extract), which can implement several bextr expressions in 
parallel. 

So, where would be the right place to implement such instructions. Would it 
make sense to recognize bextr early before we get to i386 code, or would it be 
better to recognize it late. And where do I put such instruction selection 
optimizations?

Motivating example:

unsigned rgb32_to_rgb16(unsigned rgb32) {
unsigned char red = (rgb32 >> 19) & 0x1f;
unsigned char green = (rgb32 >> 10) & 0x3f;
unsigned char blue = rgb32  & 0x1f;
   return (red << 11) | (green << 5) | blue;
}

can be implemented as pext(rgb32, 0x001f3f1f)

Best regards
`Allan Sandfeld


Problems with pragma and attribute optimize.

2012-07-25 Thread Allan Sandfeld Jensen
Hi,

I have been experimenting with marking specific functions to be auto-
vectorized in GCC, but have had problems getting it to work.

It seems the optimize attribute works sometimes, but only if the function it 
is used on is not static, but pragma optimize never seems to work.

See the attached test-case. If you compile it with -ftree-vectorizer-verbose, 
you will see that only the first function is vectorized, but the two last are 
not. 

Anyone know what is wrong here?

Best regards
`Allan
#include 

void  __attribute__((optimize("tree-vectorize"))) innerloop_1(int16_t* destination, const int16_t* source1, const int16_t* source2, int length)
{
while (length--) {
*(destination++) = *(source1++) + *(source2++);
}
}

static void  __attribute__((optimize("tree-vectorize"))) innerloop_2(int16_t* destination, const int16_t* source1, const int16_t* source2, int length)
{
while (length--) {
*(destination++) = *(source1++) + *(source2++);
}
}

void caller(int16_t* destination, const int16_t* source1, const int16_t* source2, int length)
{
innerloop_2(destination, source1, source2, length);
}

#pragma GCC optimize("tree-vectorize")

void innerloop_3(int16_t* destination, const int16_t* source1, const int16_t* source2, int length)
{
while (length--) {
*(destination++) = *(source1++) + *(source2++);
}
}

Re: Problems with pragma and attribute optimize.

2012-07-25 Thread Allan Sandfeld Jensen
On Wednesday 25 July 2012, Richard Guenther wrote:
> On Wed, Jul 25, 2012 at 2:23 PM, Allan Sandfeld Jensen
> 
>  wrote:
> > Hi,
> > 
> > I have been experimenting with marking specific functions to be auto-
> > vectorized in GCC, but have had problems getting it to work.
> > 
> > It seems the optimize attribute works sometimes, but only if the function
> > it is used on is not static, but pragma optimize never seems to work.
> > 
> > See the attached test-case. If you compile it with
> > -ftree-vectorizer-verbose, you will see that only the first function is
> > vectorized, but the two last are not.
> > 
> > Anyone know what is wrong here?
> 
> The attribute doesn't work in the face of inlining and generally was
> designed for debugging, not for controlling things like you do.
> 

In that case the GCC manual should probably be updated to reflect that. If 
what you say it true, it seems it has been developed for one purpose but then 
documented for another. (The documentation needs updating anyway, since the 
attribute is no longer allowed after the function declaration like all the 
examples does, but only before).

I found the problem with pragma though, it is apparently a long standing bug 
in PR 48026 and a related version in PR 41201. The last bug actually has a 
patch for the problem, it is apparently caused by an incorrect short-cut.

`Allan


Re: Problems with pragma and attribute optimize.

2012-07-30 Thread Allan Sandfeld Jensen
On Wednesday 25 July 2012, Richard Guenther wrote:
> On Wed, Jul 25, 2012 at 4:25 PM, Allan Sandfeld Jensen
> 
>  wrote:
> > On Wednesday 25 July 2012, Richard Guenther wrote:
> >> On Wed, Jul 25, 2012 at 2:23 PM, Allan Sandfeld Jensen
> >> 
> >>  wrote:
> >> > Hi,
> >> > 
> >> > I have been experimenting with marking specific functions to be auto-
> >> > vectorized in GCC, but have had problems getting it to work.
> >> > 
> >> > It seems the optimize attribute works sometimes, but only if the
> >> > function it is used on is not static, but pragma optimize never seems
> >> > to work.
> >> > 
> >> > See the attached test-case. If you compile it with
> >> > -ftree-vectorizer-verbose, you will see that only the first function
> >> > is vectorized, but the two last are not.
> >> > 
> >> > Anyone know what is wrong here?
> >> 
> >> The attribute doesn't work in the face of inlining and generally was
> >> designed for debugging, not for controlling things like you do.
> > 
> > In that case the GCC manual should probably be updated to reflect that.
> > If what you say it true, it seems it has been developed for one purpose
> > but then documented for another. (The documentation needs updating
> > anyway, since the attribute is no longer allowed after the function
> > declaration like all the examples does, but only before).
> > 
> > I found the problem with pragma though, it is apparently a long standing
> > bug in PR 48026 and a related version in PR 41201. The last bug actually
> > has a patch for the problem, it is apparently caused by an incorrect
> > short-cut.
> 
> CCing the original author.
> 
Cool.

Anyway, for my usecase of these flags, you can see: 
https://bugs.webkit.org/show_bug.cgi?id=92249

It uses the optimize attribute currently,  because pragma didn't work, and 
forces noinline so that the attribute is not lost on inline.

`Allan