[Bug fortran/78611] -march=native makes code 3x slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611 --- Comment #11 from Jan Lachnitt --- Thank you all for a rapid investigation of the problem. Here is a confirmation with the large test case: jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy -I. -march=core-avx-i -o core-avx-i/phsh1 jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd core-avx-i/ jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ time ./phsh1 < ../bmtz Slab or Bulk calculation? input 1 for Slab or 0 for Bulk Input the MTZ value from the substrate calculation real221m0.225s user220m52.488s sys 0m4.488s jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ rm check.o mufftin.d jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/core-avx-i $ LD_BIND_NOW=1 time ./phsh1 < ../bmtz Slab or Bulk calculation? input 1 for Slab or 0 for Bulk Input the MTZ value from the substrate calculation 4512.06user 1.50system 1:15:16elapsed 99%CPU (0avgtext+0avgdata 7296maxresident)k 23408inputs+34424outputs (7major+1219minor)pagefaults 0swaps Really, LD_BIND_NOW=1 does wonders :-) . https://sourceware.org/bugzilla/show_bug.cgi?id=20495#c8 suggests building with "-Wl,-z,now" (I suppose this does the same as LD_BIND_NOW=1). Can it be used as a general workaround, before glibc 2.25 is available?
[Bug fortran/78611] -march=native makes code 3x slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611 --- Comment #4 from Jan Lachnitt --- Small test case with -march=core-avx-i: real0m1.300s user0m1.296s sys 0m0.000s I.e., reproduced.
[Bug fortran/78611] -march=native makes code 3x slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611 --- Comment #1 from Jan Lachnitt --- Created attachment 40200 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40200=edit Smaller test case Here is a smaller test case, which runs for a second only, not hours. without -march=native: real0m0.610s user0m0.560s sys 0m0.000s with -march=native: real0m1.271s user0m1.268s sys 0m0.000s
[Bug fortran/78611] New: -march=native makes code 3x slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78611 Bug ID: 78611 Summary: -march=native makes code 3x slower Product: gcc Version: 6.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: pepalogik at seznam dot cz Target Milestone: --- Created attachment 40199 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40199=edit Source code, include files, and inputs Hi, I encountered the problem in version 5.4.0, then installed 6.2.0, and it's still the same. Details below and test case attached. jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 -v Using built-in specs. COLLECT_GCC=gfortran-6 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-3ubuntu11~16.04' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 6.2.0 20160901 (Ubuntu 6.2.0-3ubuntu11~16.04) jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy -I. -o default/phsh1 jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd default/ jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/default $ time ./phsh1 < ../bmtz Slab or Bulk calculation? input 1 for Slab or 0 for Bulk Input the MTZ value from the substrate calculation real72m51.345s user72m48.584s sys 0m0.968s jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/default $ cd .. jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ gfortran-6 phsh1.f -std=legacy -I. -march=native -o march/phsh1 jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1 $ cd march/ jenda@VivoBook ~/Bug reports/gfortran/6/PhSh1/march $ time ./phsh1 < ../bmtz Slab or Bulk calculation? input 1 for Slab or 0 for Bulk Input the MTZ value from the substrate calculation real217m56.080s user217m52.092s sys 0m1.096s As shown, code compiled with -march=native is 3x slower. All outputs are identical, so it is solely a performance issue. Adding -O3 isn't very helpful. My CPU is Intel(R) Core(TM) i3-3217U CPU @ 1.80GHz with these flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx f16c lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts The code is an old, single-threaded F77 program calculating crystal potentials. Profiler shows that almost all the time is spent in subroutine MTZ.
[Bug fortran/52621] ICE when compiling Fortran77 code with optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621 --- Comment #3 from Jan Lachnitt pepalogik at seznam dot cz 2012-03-21 12:18:59 UTC --- Thanks for testing and for the link to GFortranBinaries. I have just installed the very recent unofficial build of gfortran: Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=c:/program files/gfortran/bin/../libexec/gcc/i586-pc-mingw32 /4.8.0/lto-wrapper.exe Target: i586-pc-mingw32 Configured with: ../gcc-trunk/configure --prefix=/mingw --enable-languages=c,for tran --with-gmp=/home/brad/gfortran/dependencies --disable-werror --enable-threa ds --disable-nls --build=i586-pc-mingw32 --enable-libgomp --enable-shared --disa ble-win32-registry --with-dwarf2 --disable-sjlj-exceptions --enable-lto Thread model: win32 gcc version 4.8.0 20120319 (experimental) [trunk revision 185521] (GCC) The result is that the ICE is still there. There are just two changes. First, there are some more warnings, and second, the ICE is reported with a different line number within GCC source: tree-data-ref.c:1964.
[Bug fortran/52621] New: ICE when compiling Fortran77 code with optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621 Bug #: 52621 Summary: ICE when compiling Fortran77 code with optimization Classification: Unclassified Product: gcc Version: 4.6.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran AssignedTo: unassig...@gcc.gnu.org ReportedBy: pepalo...@seznam.cz Created attachment 26920 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26920 Library source producing the ICE I am compiling an old Fortran 77 code on Windows XP. I have fixed this code to make it basically work in both FTN95 (Silverfrost) and gfortran compilers. But when I try to make a highly optimized build with gfortran, I get an ICE. Compiler version: C:\MinGW\bingfortran.exe -v Using built-in specs. COLLECT_GCC=gfortran.exe COLLECT_LTO_WRAPPER=c:/mingw/bin/../libexec/gcc/mingw32/4.6.1/lto-wrapper.exe Target: mingw32 Configured with: ../gcc-4.6.1/configure --enable-languages=c,c++,fortran,objc,ob j-c++ --disable-sjlj-exceptions --with-dwarf2 --enable-shared --enable-libgomp - -disable-win32-registry --enable-libstdcxx-debug --enable-version-specific-runti me-libs --build=mingw32 --prefix=/mingw Thread model: win32 gcc version 4.6.1 (GCC) Command: gfortran.exe -std=legacy -march=native -mfpmath=sse -m3dnow -mmmx -msse -msse2 -msse3 -O3 -Wall-c D:\Jenda\cbp\SATLEED\LEEDSATL_SB\leedsatl_sb.f -o obj\Release\leedsatl_sb.o CPU: AMD Athlon x2, see http://www.cpu-world.com/CPUs/K8/AMD-Athlon%20X2%204850e%20-%20ADH4850IAA5DO%20(ADH4850DOBOX).html Important: The ICE is gone if I decrease the optimization level to -O2 or exclude the machine specific options (from -march to -msse3). The code and output are attached. Copyright note: The code comes from http://www.ap.cityu.edu.hk/personal-website/Van-Hove_files/leed/leedpack.html and I am actually not allowed to distribute it.
[Bug fortran/52621] ICE when compiling Fortran77 code with optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52621 --- Comment #1 from Jan Lachnitt pepalogik at seznam dot cz 2012-03-19 16:13:22 UTC --- Created attachment 26921 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26921 Compiler output
[Bug c++/35159] g++ and gfortran inoperable with no error message
--- Comment #22 from pepalogik at seznam dot cz 2008-09-21 15:02 --- I'm probably not the one who'll find the core of the bug but I'd like to mention two simple facts: 1: mingw-w64-bin_i686-mingw_20080707 WORKS 2: mingw-w64-bin_x86_64-mingw_20080724 DOESN'T WORK (Vista64 SP1) I don't use it currently so I haven't tried new versions. Btw. I think it's GCC v. 4.4.0 (experimental) instead of 4.3.0. -- pepalogik at seznam dot cz changed: What|Removed |Added CC||pepalogik at seznam dot cz http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35159
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #117 from pepalogik at seznam dot cz 2008-06-24 20:12 --- (In reply to comment #116) Yes, but this requires quite a complicated workaround (solution (4) in my comment #109). The problem is on the compiler side, which could store every result of a cast or an assignment to memory (this is inefficient, but that's what you get with the x87, and the ISO C language could be blamed too for *requiring* something like that instead of being more flexible). So you could say that the IEEE754 double precision type is available even on a processor without any FPU because this can be emulated using integers. Yes, but a conforming implementation would be the processor + a library, not just the processor with its instruction set. Moreover, if we assess things pedantically, the workaround (4) still doesn't fully obey the IEEE single/double precision type(s), because there remains the problem of double rounding of denormals. As I said, in this particular case (underflow/overflow), double rounding is allowed by the IEEE standard. It may not be allowed by some languages (e.g. XPath, and Java in some mode) for good or bad reasons, but this is another problem. OK, thanks for explanation. I think now it's clear. I quote, too: Applies To Microsoft#174; Visual C++#174; Now I assume that it follows the MS-Windows API (though nothing is certain with Microsoft). And the other compilers under MS-Windows could (or should) do the same thing. By a lucky hit, I have found this in the GCC documentation: -mpc32 -mpc64 -mpc80 Set 80387 floating-point precision to 32, 64 or 80 bits. When '-mpc32' is specified, the significands of results of floating-point operations are rounded to 24 bits (single precision); '-mpc64' rounds the the significands of results of floatingpoint operations to 53 bits (double precision) and '-mpc80' rounds the significands of results of floating-point operations to 64 bits (extended double precision), which is the default. When this option is used, floating-point operations in higher precisions are not available to the programmer without setting the FPU control word explicitly. [...] So GCC sets extended precision by default. And it's easy to change it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #114 from pepalogik at seznam dot cz 2008-06-22 16:59 --- (In reply to comment #113) It is available when storing a result to memory. Yes, but this requires quite a complicated workaround (solution (4) in my comment #109). So you could say that the IEEE754 double precision type is available even on a processor without any FPU because this can be emulated using integers. Moreover, if we assess things pedantically, the workaround (4) still doesn't fully obey the IEEE single/double precision type(s), because there remains the problem of double rounding of denormals. The IEEE754-1985 allows this. Section 4.3: Normally, a result is rounded to the precision of its destination. However, some systems deliver results only to double or extended destinations. On such a system the user, which may be a high-level language compiler, shall be able to specify that a result be rounded instead to single precision, though it may be stored in the double or extended format with its wider exponent range. [...] [...] AFAIK, the IEEE754-1985 standard was designed from the x87 implementation, so it would have been very surprising that x87 didn't conform to IEEE754-1985. So it seems I was wrong but the IEEE754-1985 standard is also quite wrong. Do you mean that on Windows, long double has (by default) no more precision than double? I don't think so (it's confirmed by my experience). I don't remember my original reference, but here's a new one: http://msdn.microsoft.com/en-us/library/aa289157(vs.71).aspx In fact, this depends on the architecture. I quote: x86. Intermediate expressions are computed at the default 53-bit precision with an extended range [...] I quote, too: Applies To Microsoft#174; Visual C++#174; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #115 from pepalogik at seznam dot cz 2008-06-22 17:28 --- That #174; should be (R). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #112 from pepalogik at seznam dot cz 2008-06-21 22:38 --- (In reply to comment #111) Concerning the standards: The x87 FPU does obey the IEEE754-1985 standard, which *allows* extended precision, and double precision is *available*. It's true that double *precision* is available on x87. But not the *IEEE-754 double precision type*. Beside the precision of mantissa, this includes also the range of exponent. On the x87, it is possible to set the precision of mantissa but not the range of exponent. That's why I believe it doesn't obey the IEEE. (I haven't ever seen the IEEE-754 standard but I base on the work of David Monniaux.) Note: the solution chosen by some OS'es (*BSD, MS-Windows...) is to configure the processor to the IEEE double precision by default (thus long double is also in double precision, but this is OK as far as the C language is concerned, there's still a problem with float, but in practice, nobody cares AFAIK). Do you mean that on Windows, long double has (by default) no more precision than double? I don't think so (it's confirmed by my experience). According to the paper of David Monniaux, only FreeBSD 4 sets double precision by default (but I know almost nothing about BSD). (1) A very simple solution: Use long double everywhere. This avoids the bug, but this is not possible for software that requires double precision exactly, e.g. XML tools that use XPath. Yes, of course. I don't say this can be used everywhere. (But be careful when transfering binary data in long double format between computers because this format is not standardized and so the concrete bit representations vary between different CPU architectures.) Well, this is not specific to long double anyway: there exist 3 possible endianess for the double format (x86, PowerPC, ARM). OK but David Monniaux mentions portability issues just in the case of long double, so the differences are probably more frequent in this case (maybe even within the x86 architecture). Yes, but note that this is not the only problem with compilers. See e.g. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36578 Thanks for info. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #110 from pepalogik at seznam dot cz 2008-06-12 14:14 --- I used an old version of GCC documentation so I omitted some new processors with SSE: core2, k8-sse3, opteron-sse3, athlon64-sse3, amdfam10 and barcelona. I think you can use -march=pentium3 for all Intel's CPUs (of course, starting with P3). I'm unsure about AMD. (Maybe you know it better.) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
[Bug rtl-optimization/323] optimized code gives strange floating point results
--- Comment #109 from pepalogik at seznam dot cz 2008-05-20 16:59 --- I also encountered such problems and was going to report it as a bug in GCC... But in the GCC bug (not) reporting guide, there is fortunately a link to this page and here (comment #96) is a link to David Monniaux's paper about floating-point computations. This explains it closely but it is maybe too long. I have almost read it and hope I have understood it properly. So I'll give a brief explanation (for those who don't know it yet) of the reasons of such a strange behaviour. Then I'll assess where the bug actually is (in GCC or CPU). Then I'll write the solution (!) and finally a few recommendations to the GCC team. EXPLANATION The x87 FPU was originally designed in (or before) 1980. I think that's why it is quite simple: it has only one unit for all FP data types. Of course, the precision must be of the widest type, which is the 80-bit long double. Consider you have a program, where all the FP variables are of the type double. You perform some FP operations and one of them is e.g. 1e-300/1e300, which results in 1e-600. Despite this value cannot be held by a double, it is stored in an 80-bit FPU register as the result. Consider you use the variable x to hold that result. If the program has been compiled with optimization, the value need not be stored in RAM. So, say, it is still in the register. Consider you need x to be nonzero, so you perform the test x != 0. Since 1e-600 is not zero, the test yields true. While you perform some other computations, the value is moved to RAM and converted to 0 because x is of type double. Now you want to use your certainly nonzero x... Hard luck :-( Note that if the result doesn't have its corresponding variable and you perform the test directly on an expression, the problem can come to light even without optimization. It could seem that performing all FP operations in extended precision can bring benefits only. But it introduces a serious pitfall: moving a value may change the value!!! WHERE'S THE BUG This is really not a GCC bug. The bug is actually in the x87 FPU because it doesn't obey the IEEE standard. SOLUTION The x87 FPU is still present in contemporary processors (including AMD) due to compatibility. I think most of PC software still uses it. But new processors have also another FPU, called SSE, and this do obey the IEEE. GCC in 32-bit mode compiles for x87 by default but it is able to compile for the SSE, too. So the solution is to add these options to the compilation command: -march=* -msse -mfpmath=sse Yes, this definitely resolves the problem - but not for all processors. The * can be one of the following: pentium3, pentium3m, pentium-m, pentium4, pentium4m, prescott, nocona, athlon-4, athlon-xp, athlon-mp, k8, opteron, athlon64, athlon-fx and c3-2 (I'm unsure about athlon and athlon-tbird). Beside -msse, you can also add some of -mmmx, -msse2, -msse3 and -m3dnow, if the CPU supports them (see GCC doc or CPU doc). If you wish to compile for processors which don't have SSE, you have a few possibilities: (1) A very simple solution: Use long double everywhere. (But be careful when transfering binary data in long double format between computers because this format is not standardized and so the concrete bit representations vary between different CPU architectures.) (2) A partial but simple solution: Do comparisons on volatile variables only. (3) A similar solution: Try to implement a discard_extended_precision function suggested by Egon in comment #88. (4) A complex solution: Before doing any mathematical operation or comparison, put the operands into variables and put also the result to a variable (i.e. don't use complex expressions). For example, instead of { c = 2*(a+b); } , write { double s = a+b; c = 2*s; } . I'm unsure about arrays but I think they should be OK. When you have modified your code in this manner, then compile it either without optimization or, when using optimization, use -ffloat-store. In order to avoid double rounding (i.e. rounding twice), it is also good to decrease the FPU precision by changing its control word in the beginning of your program (see comment #60). Then you should also apply -frounding-math. (5) A radical solution: Find a job/hobby where computers are not used at all. RECOMMENDATIONS I think this problem is really serious and general. Therefore, programmers should be warned soon enough. This recommendation should be addressed especially to authors of programming coursebooks. But I think there could also be a paragraph about it in the GCC documentation (I haven't read it wholly but it doesn't seem there's any warning against x87). And, of course, there should be a warning in the bug reporting guide (http://gcc.gnu.org/bugs.html). It's fine there's a link to this page (Bug 323) but the example with (int)(a/b) is insufficient. It only demonstrates that real numbers are often not represented exactly in the computer. It doesn't demonstrate