[Bug fortran/78611] -march=native makes code 3x slower

2016-11-30 Thread pepalogik at seznam dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611

--- Comment #11 from Jan Lachnitt  ---
Thank you all for a rapid investigation of the problem.

Here is a confirmation with the large test case:

jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy
-I. -march=core-avx-i -o core-avx-i/phsh1
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd core-avx-i/
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ time ./phsh1 <
../bmtz
 Slab or Bulk calculation?
 input 1 for Slab or 0 for Bulk
 Input the MTZ value from the substrate calculation

real221m0.225s
user220m52.488s
sys 0m4.488s
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ rm check.o mufftin.d 
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ LD_BIND_NOW=1 time
./phsh1 < ../bmtz
 Slab or Bulk calculation?
 input 1 for Slab or 0 for Bulk
 Input the MTZ value from the substrate calculation
4512.06user 1.50system 1:15:16elapsed 99%CPU (0avgtext+0avgdata
7296maxresident)k
23408inputs+34424outputs (7major+1219minor)pagefaults 0swaps


Really, LD_BIND_NOW=1 does wonders :-) .

https://sourceware.org/bugzilla/show_bug.cgi?id=20495#c8 suggests building with
"-Wl,-z,now" (I suppose this does the same as LD_BIND_NOW=1). Can it be used as
a general workaround, before glibc 2.25 is available?

[Bug fortran/78611] -march=native makes code 3x slower

2016-11-30 Thread pepalogik at seznam dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611

--- Comment #4 from Jan Lachnitt  ---
Small test case with -march=core-avx-i:
real0m1.300s
user0m1.296s
sys 0m0.000s

I.e., reproduced.

[Bug fortran/78611] -march=native makes code 3x slower

2016-11-30 Thread pepalogik at seznam dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611

--- Comment #1 from Jan Lachnitt  ---
Created attachment 40200
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40200=edit
Smaller test case

Here is a smaller test case, which runs for a second only, not hours.

without -march=native:
real0m0.610s
user0m0.560s
sys 0m0.000s

with -march=native:
real0m1.271s
user0m1.268s
sys 0m0.000s

[Bug fortran/78611] New: -march=native makes code 3x slower

2016-11-30 Thread pepalogik at seznam dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611

Bug ID: 78611
   Summary: -march=native makes code 3x slower
   Product: gcc
   Version: 6.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: pepalogik at seznam dot cz
  Target Milestone: ---

Created attachment 40199
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40199=edit
Source code, include files, and inputs

Hi,

I encountered the problem in version 5.4.0, then installed 6.2.0, and it's
still the same. Details below and test case attached.

jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 -v
Using built-in specs.
COLLECT_GCC=gfortran-6
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
6.2.0-3ubuntu11~16.04' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-6 --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 6.2.0 20160901 (Ubuntu 6.2.0-3ubuntu11~16.04)
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy
-I. -o default/phsh1
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd default/
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/default $ time ./phsh1 < ../bmtz
 Slab or Bulk calculation?
 input 1 for Slab or 0 for Bulk
 Input the MTZ value from the substrate calculation

real72m51.345s
user72m48.584s
sys 0m0.968s
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/default $ cd ..
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy
-I. -march=native -o march/phsh1
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd march/
jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/march $ time ./phsh1 < ../bmtz
 Slab or Bulk calculation?
 input 1 for Slab or 0 for Bulk
 Input the MTZ value from the substrate calculation

real217m56.080s
user217m52.092s
sys 0m1.096s


As shown, code compiled with -march=native is 3x slower. All outputs are
identical, so it is solely a performance issue. Adding -O3 isn't very helpful.
My CPU is Intel(R) Core(TM) i3-3217U CPU @ 1.80GHz with these flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu
pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx f16c lahf_lm epb
tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida
arat pln pts

The code is an old, single-threaded F77 program calculating crystal potentials.
Profiler shows that almost all the time is spent in subroutine MTZ.

[Bug fortran/52621] ICE when compiling Fortran77 code with optimization

2012-03-21 Thread pepalogik at seznam dot cz
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621

--- Comment #3 from Jan Lachnitt pepalogik at seznam dot cz 2012-03-21 
12:18:59 UTC ---
Thanks for testing and for the link to GFortranBinaries. I have just installed
the very recent unofficial build of gfortran:
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=c:/program
files/gfortran/bin/../libexec/gcc/i586-pc-mingw32
/4.8.0/lto-wrapper.exe
Target: i586-pc-mingw32
Configured with: ../gcc-trunk/configure --prefix=/mingw
--enable-languages=c,for
tran --with-gmp=/home/brad/gfortran/dependencies --disable-werror
--enable-threa
ds --disable-nls --build=i586-pc-mingw32 --enable-libgomp --enable-shared
--disa
ble-win32-registry --with-dwarf2 --disable-sjlj-exceptions --enable-lto
Thread model: win32
gcc version 4.8.0 20120319 (experimental) [trunk revision 185521] (GCC)

The result is that the ICE is still there. There are just two changes. First,
there are some more warnings, and second, the ICE is reported with a different
line number within GCC source: tree-data-ref.c:1964.


[Bug fortran/52621] New: ICE when compiling Fortran77 code with optimization

2012-03-19 Thread pepalogik at seznam dot cz
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621

 Bug #: 52621
   Summary: ICE when compiling Fortran77 code with optimization
Classification: Unclassified
   Product: gcc
   Version: 4.6.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: pepalo...@seznam.cz


Created attachment 26920
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26920
Library source producing the ICE

I am compiling an old Fortran 77 code on Windows XP. I have fixed this code to
make it basically work in both FTN95 (Silverfrost) and gfortran compilers. But
when I try to make a highly optimized build with gfortran, I get an ICE.

Compiler version:
C:\MinGW\bingfortran.exe -v
Using built-in specs.
COLLECT_GCC=gfortran.exe
COLLECT_LTO_WRAPPER=c:/mingw/bin/../libexec/gcc/mingw32/4.6.1/lto-wrapper.exe
Target: mingw32
Configured with: ../gcc-4.6.1/configure
--enable-languages=c,c++,fortran,objc,ob
j-c++ --disable-sjlj-exceptions --with-dwarf2 --enable-shared --enable-libgomp
-
-disable-win32-registry --enable-libstdcxx-debug
--enable-version-specific-runti
me-libs --build=mingw32 --prefix=/mingw
Thread model: win32
gcc version 4.6.1 (GCC)

Command:
gfortran.exe -std=legacy -march=native -mfpmath=sse -m3dnow -mmmx -msse -msse2
-msse3  -O3   -Wall-c D:\Jenda\cbp\SATLEED\LEEDSATL_SB\leedsatl_sb.f -o
obj\Release\leedsatl_sb.o

CPU: AMD Athlon x2, see
http://www.cpu-world.com/CPUs/K8/AMD-Athlon%20X2%204850e%20-%20ADH4850IAA5DO%20(ADH4850DOBOX).html

Important: The ICE is gone if I decrease the optimization level to -O2 or
exclude the machine specific options (from -march to -msse3).

The code and output are attached.

Copyright note: The code comes from
http://www.ap.cityu.edu.hk/personal-website/Van-Hove_files/leed/leedpack.html
and I am actually not allowed to distribute it.


[Bug fortran/52621] ICE when compiling Fortran77 code with optimization

2012-03-19 Thread pepalogik at seznam dot cz
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621

--- Comment #1 from Jan Lachnitt pepalogik at seznam dot cz 2012-03-19 
16:13:22 UTC ---
Created attachment 26921
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26921
Compiler output


[Bug c++/35159] g++ and gfortran inoperable with no error message

2008-09-21 Thread pepalogik at seznam dot cz


--- Comment #22 from pepalogik at seznam dot cz  2008-09-21 15:02 ---
I'm probably not the one who'll find the core of the bug but I'd like to
mention two simple facts:
1: mingw-w64-bin_i686-mingw_20080707   WORKS
2: mingw-w64-bin_x86_64-mingw_20080724 DOESN'T WORK
(Vista64 SP1)

I don't use it currently so I haven't tried new versions.

Btw. I think it's GCC v. 4.4.0 (experimental) instead of 4.3.0.


-- 

pepalogik at seznam dot cz changed:

   What|Removed |Added

 CC||pepalogik at seznam dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35159



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-06-24 Thread pepalogik at seznam dot cz


--- Comment #117 from pepalogik at seznam dot cz  2008-06-24 20:12 ---
(In reply to comment #116)
  Yes, but this requires quite a complicated workaround (solution (4) in my
  comment #109).
 
 The problem is on the compiler side, which could store every result of a cast
 or an assignment to memory (this is inefficient, but that's what you get with
 the x87, and the ISO C language could be blamed too for *requiring* something
 like that instead of being more flexible).
 
  So you could say that the IEEE754 double precision type is available even on
  a processor without any FPU because this can be emulated using integers.
 
 Yes, but a conforming implementation would be the processor + a library, not
 just the processor with its instruction set.
 
  Moreover, if we assess things pedantically, the workaround (4) still doesn't
  fully obey the IEEE single/double precision type(s), because there remains 
  the
  problem of double rounding of denormals.
 
 As I said, in this particular case (underflow/overflow), double rounding is
 allowed by the IEEE standard. It may not be allowed by some languages (e.g.
 XPath, and Java in some mode) for good or bad reasons, but this is another
 problem.

OK, thanks for explanation. I think now it's clear.

  I quote, too:
  Applies To
 Microsoft#174; Visual C++#174;
 
 Now I assume that it follows the MS-Windows API (though nothing is certain 
 with
 Microsoft). And the other compilers under MS-Windows could (or should) do the
 same thing.

By a lucky hit, I have found this in the GCC documentation:

-mpc32
-mpc64
-mpc80
Set 80387 floating-point precision to 32, 64 or 80 bits. When '-mpc32' is
specified,
the significands of results of floating-point operations are rounded to 24
bits (single precision); '-mpc64' rounds the the significands of results of
floatingpoint
operations to 53 bits (double precision) and '-mpc80' rounds the significands
of results of floating-point operations to 64 bits (extended double precision),
which is the default. When this option is used, floating-point operations
in higher precisions are not available to the programmer without setting the
FPU control word explicitly.
[...]

So GCC sets extended precision by default. And it's easy to change it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-06-22 Thread pepalogik at seznam dot cz


--- Comment #114 from pepalogik at seznam dot cz  2008-06-22 16:59 ---
(In reply to comment #113)
 It is available when storing a result to memory.

Yes, but this requires quite a complicated workaround (solution (4) in my
comment #109). So you could say that the IEEE754 double precision type is
available even on a processor without any FPU because this can be emulated
using integers.
Moreover, if we assess things pedantically, the workaround (4) still doesn't
fully obey the IEEE single/double precision type(s), because there remains the
problem of double rounding of denormals.

 The IEEE754-1985 allows this. Section 4.3: Normally, a result is rounded to
 the precision of its destination. However, some systems deliver results only 
 to
 double or extended destinations. On such a system the user, which may be a
 high-level language compiler, shall be able to specify that a result be 
 rounded
 instead to single precision, though it may be stored in the double or extended
 format with its wider exponent range. [...]
 [...]
 AFAIK, the IEEE754-1985 standard was designed from the x87
 implementation, so it would have been very surprising that x87 didn't conform
 to IEEE754-1985.

So it seems I was wrong but the IEEE754-1985 standard is also quite wrong.

  Do you mean that on Windows, long double has (by default) no more precision
  than double? I don't think so (it's confirmed by my experience).
 I don't remember my original reference, but here's a new one:
   http://msdn.microsoft.com/en-us/library/aa289157(vs.71).aspx
 In fact, this depends on the architecture. I quote: x86. Intermediate
 expressions are computed at the default 53-bit precision with an extended 
 range
 [...]

I quote, too:
Applies To
   Microsoft#174; Visual C++#174;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-06-22 Thread pepalogik at seznam dot cz


--- Comment #115 from pepalogik at seznam dot cz  2008-06-22 17:28 ---
That #174; should be (R).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-06-21 Thread pepalogik at seznam dot cz


--- Comment #112 from pepalogik at seznam dot cz  2008-06-21 22:38 ---
(In reply to comment #111)
 Concerning the standards: The x87 FPU does obey the IEEE754-1985 standard,
 which *allows* extended precision, and double precision is *available*.

It's true that double *precision* is available on x87. But not the *IEEE-754
double precision type*. Beside the precision of mantissa, this includes also
the range of exponent. On the x87, it is possible to set the precision of
mantissa but not the range of exponent. That's why I believe it doesn't obey
the IEEE. (I haven't ever seen the IEEE-754 standard but I base on the work of
David Monniaux.)

 Note: the solution chosen by some OS'es (*BSD, MS-Windows...) is to configure
 the processor to the IEEE double precision by default (thus long double is
 also in double precision, but this is OK as far as the C language is 
 concerned,
 there's still a problem with float, but in practice, nobody cares AFAIK).

Do you mean that on Windows, long double has (by default) no more precision
than double? I don't think so (it's confirmed by my experience). According to
the paper of David Monniaux, only FreeBSD 4 sets double precision by default
(but I know almost nothing about BSD).

  (1) A very simple solution: Use long double everywhere.
 This avoids the bug, but this is not possible for software that requires 
 double
 precision exactly, e.g. XML tools that use XPath.

Yes, of course. I don't say this can be used everywhere.

  (But be careful when transfering binary data in long double format between
  computers because this format is not standardized and so the concrete bit
  representations vary between different CPU architectures.)
 Well, this is not specific to long double anyway: there exist 3 possible
 endianess for the double format (x86, PowerPC, ARM).

OK but David Monniaux mentions portability issues just in the case of long
double, so the differences are probably more frequent in this case (maybe even
within the x86 architecture).

 Yes, but note that this is not the only problem with compilers. See e.g.
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36578

Thanks for info.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-06-12 Thread pepalogik at seznam dot cz


--- Comment #110 from pepalogik at seznam dot cz  2008-06-12 14:14 ---
I used an old version of GCC documentation so I omitted some new processors
with SSE: core2, k8-sse3, opteron-sse3, athlon64-sse3, amdfam10 and barcelona.
I think you can use -march=pentium3 for all Intel's CPUs (of course, starting
with P3). I'm unsure about AMD. (Maybe you know it better.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323



[Bug rtl-optimization/323] optimized code gives strange floating point results

2008-05-20 Thread pepalogik at seznam dot cz


--- Comment #109 from pepalogik at seznam dot cz  2008-05-20 16:59 ---
I also encountered such problems and was going to report it as a bug in GCC...
But in the GCC bug (not) reporting guide, there is fortunately a link to this
page and here (comment #96) is a link to David Monniaux's paper about
floating-point computations. This explains it closely but it is maybe too long.
I have almost read it and hope I have understood it properly. So I'll give a
brief explanation (for those who don't know it yet) of the reasons of such a
strange behaviour. Then I'll assess where the bug actually is (in GCC or CPU).
Then I'll write the solution (!) and finally a few recommendations to the GCC
team.

EXPLANATION
The x87 FPU was originally designed in (or before) 1980. I think that's why it
is quite simple: it has only one unit for all FP data types. Of course, the
precision must be of the widest type, which is the 80-bit long double.
Consider you have a program, where all the FP variables are of the type double.
You perform some FP operations and one of them is e.g. 1e-300/1e300, which
results in 1e-600. Despite this value cannot be held by a double, it is
stored in an 80-bit FPU register as the result. Consider you use the variable
x to hold that result. If the program has been compiled with optimization,
the value need not be stored in RAM. So, say, it is still in the register.
Consider you need x to be nonzero, so you perform the test x != 0. Since 1e-600
is not zero, the test yields true. While you perform some other computations,
the value is moved to RAM and converted to 0 because x is of type double. Now
you want to use your certainly nonzero x... Hard luck :-(
Note that if the result doesn't have its corresponding variable and you perform
the test directly on an expression, the problem can come to light even without
optimization.
It could seem that performing all FP operations in extended precision can bring
benefits only. But it introduces a serious pitfall: moving a value may change
the value!!!

WHERE'S THE BUG
This is really not a GCC bug. The bug is actually in the x87 FPU because it
doesn't obey the IEEE standard.

SOLUTION
The x87 FPU is still present in contemporary processors (including AMD) due to
compatibility. I think most of PC software still uses it. But new processors
have also another FPU, called SSE, and this do obey the IEEE. GCC in 32-bit
mode compiles for x87 by default but it is able to compile for the SSE, too. So
the solution is to add these options to the compilation command:
-march=* -msse -mfpmath=sse
Yes, this definitely resolves the problem - but not for all processors. The *
can be one of the following: pentium3, pentium3m, pentium-m, pentium4,
pentium4m, prescott, nocona, athlon-4, athlon-xp, athlon-mp, k8, opteron,
athlon64, athlon-fx and c3-2 (I'm unsure about athlon and athlon-tbird). Beside
-msse, you can also add some of -mmmx, -msse2, -msse3 and -m3dnow, if the CPU
supports them (see GCC doc or CPU doc).
If you wish to compile for processors which don't have SSE, you have a few
possibilities:
(1) A very simple solution: Use long double everywhere. (But be careful when
transfering binary data in long double format between computers because this
format is not standardized and so the concrete bit representations vary between
different CPU architectures.)
(2) A partial but simple solution: Do comparisons on volatile variables only.
(3) A similar solution: Try to implement a discard_extended_precision
function suggested by Egon in comment #88.
(4) A complex solution: Before doing any mathematical operation or comparison,
put the operands into variables and put also the result to a variable (i.e.
don't use complex expressions). For example, instead of { c = 2*(a+b); } ,
write { double s = a+b; c = 2*s; } . I'm unsure about arrays but I think they
should be OK. When you have modified your code in this manner, then compile it
either without optimization or, when using optimization, use -ffloat-store. In
order to avoid double rounding (i.e. rounding twice), it is also good to
decrease the FPU precision by changing its control word in the beginning of
your program (see comment #60). Then you should also apply -frounding-math.
(5) A radical solution: Find a job/hobby where computers are not used at all.

RECOMMENDATIONS
I think this problem is really serious and general. Therefore, programmers
should be warned soon enough.
This recommendation should be addressed especially to authors of programming
coursebooks. But I think there could also be a paragraph about it in the GCC
documentation (I haven't read it wholly but it doesn't seem there's any warning
against x87). And, of course, there should be a warning in the bug reporting
guide (http://gcc.gnu.org/bugs.html). It's fine there's a link to this page
(Bug 323) but the example with (int)(a/b) is insufficient. It only demonstrates
that real numbers are often not represented exactly in the computer. It doesn't
demonstrate