Thanks Richard for your input, much appreciated.
I followed up on your suggestions; unfortunately the -Wdisabled-optimization option you suggested did not cause any warnings. Still trying one by one the --params options without success. I got a new hint, though, running the same examples on a MacBook I don't see the same issue at all, time
difference between 64-bit and 32-bit in each optimize/debug versions is slightly off, and 64-bit always about 10% faster in each class. I guess somehow the compiler flags are different, perhaps you, or someone knows what flags are set differently by default between them, though is hard to compare
the actual speeds because the hardware is different. Here are the specs on the mac:
gcc: Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn) - don't
know what that means expected a number like 4.2.1 or something like that,
2.53 GHz Intel Core 2 Duo
Anything comes to your mind?
Thanks again for your help,
Ricardo
On 1/20/15 1:21 AM, Richard Biener wrote:
On Tue, Jan 20, 2015 at 4:57 AM, Ricardo Telichevesky <rica...@teli.org> wrote:
Hi,
I have a strange problem with extremely large procedures when generating
64-bit code
I am using gcc 4.9.2 on RHEL6.3 on a 64-thread 4-socket Xeon E7 4820 with
256GB of memory. No avx extensions, using sse option when building the
compiler. This particular code is serial. I made measurements with 32- and
64- bit both debug -g and optimize -O3 for two different examples (this is a
circuit simulator and each example is a different circuit that uses
different transistors).
Example A is the one the effect is more acute. I listed at the bottom of
the e-mail the 3 procedures that consume 90% of the execution time:
a) As a counter-example, the factor code listed is heavily optimized
hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize
is way faster than any other, up to 15x faster than 32-bit debug (btw great
job in the compiler, it is really shining here).
b) evalTran has 18000 lines of auto-generated code and behaves very
counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize
code.
c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit.
Notice that all the data structures in 32-bit code and 64-bit code are
identical and most variables are identical - in fact all integers used are
64-bit, and most operations are floating-point ops. Initially I thought the
64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed.
In fact the difference in code sizes is not even 10% (at least debug -
notice I calculated the size of each procedure in bytes) so my trash-the
I-cache conjecture seems to be wrong. The overall execution time is causing
us a lot of problems - 64-bit optimize takes 16seconds, even more than
32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only
care about 64-bit optimize we got a big problem here.
Example B is not so bad, and in fact 64-bit code is slightly faster than
32-bit code, would be nice if went even faster, but if I got A to behave
like that I'd be pretty happy already.
I tried to look at the wide array of optimizing options for the code, it
is is a dizzying task and I could not get any kind of intuition besides the
-O3 ... Would you have any suggestions for the proper flags for those
ridiculously large auto-generated codes that might be able to alleviate this
32-bit vs 64-bit problem? Would you think that the fact this code is in a
dynamic linked library (-fPIC) plays a role?
It's hard to tell without a testcase but GCC has various limits on
code sizes passes deal with so you might trip one of these which
effectively would disable optimizations. For example loop dependence
analysis has a limit on the number of memory references it considers
(--param loop-max-datarefs-for-datadeps, default 1000). Note that not
all such limits are controlled by --params. We have
-Wdisabled-optimization that should warn if you run into any such
case (but the warning is unfortunately not correctly implemented by
all passes having such limits).
Thanks,
Richard.
Thanks very much for your help,
Ricardo
All times are wall clock in micro-seconds - the main was checked against the
reported UNIX time and is exact.
example A
==========
evalTran has 18000 lines of C code (two for loops around 99% of the code)
evalTranRhs has 5000 lines of C code (two for loops around 99% of the code)
32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
2.503 254536 8335 30 numerical TRAN factor
56.01 5695065 8335 683 evalTran bytes=231791
35.41 3600646 13924 258 evalTranRhs bytes=57501
100 10168242 1 10168242 main @DT@
32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.710 34442 8335 4 numerical TRAN factor
43.06 2087757 8335 250 evalTran
43.49 2108786 13925 151 evalTranRhs
100 4848520 1 4848520 main @DT@
64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.973 205144 8335 24 numerical TRAN factor
46.43 9785920 8335 1174 evalTran bytes=252741
49.72 10478888 13924 752 evalTranRhs bytes=58442
100 21077659 1 21077659 main @DT@
64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.147 23819 8335 2 numerical TRAN factor
39.26 6360254 8335 763 evalTran
57.28 9279087 13924 666 evalTranRhs
100 16198762 1 16198762 main @DT@
example B
=========
evalTran has 10000 lines of C code (two for loops around 99% of the code)
evalTranRhs has 2500 lines of C code (two for loops around 99% of the code)
32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
6.55 989826 46612 21 numerical TRAN factor
63.17 9546694 46612 204 evalTran bytes=141478
22.36 3379311 47626 70 evalTranRhs bytes=35871
100 15112540 1 15112540 main @DT@
32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
3.012 157060 46612 3 numerical TRAN factor
50.42 2629251 46612 56 evalTran
34.18 1782641 47626 37 evalTranRhs
100 5214827 1 5214827 main @DT@
64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
6.439 837743 46612 17 numerical TRAN factor
63.02 8199007 46612 175 evalTran bytes=154542
22.21 2889893 47626 60 evalTranRhs bytes=36487
100 13011058 1 13011058 main @DT@
64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
2.389 103855 46612 2 numerical TRAN factor
53.52 2326715 46612 49 evalTran
33.1 1438995 47626 30 evalTranRhs
100 4347691 1 4347691 main @DT@