[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 --- Comment #32 from Giuliano Belinassi --- (In reply to Eric Gallager from comment #31) > I think this came up at Cauldron, but I forget what exactly people said > about it... Actually this PR comes before Cauldron 2019. One way to fix this issue is to make the match.pd parser output several smaller gimple-match.c, and add these to the Makefile. Also repeat this procedure to other big files. Another solution is to parallelize GCC internals and make GCC communicate with Make somehow so that when a CPU is idle, it starts compiling some files in parallel.
[Bug tree-optimization/90292] GCC Fails to hoist loop invariant in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292 --- Comment #2 from Giuliano Belinassi --- Just for the sake of completeness, this issue is not addressed by just changing the iterators to 'int'. However, it is in fact solved by changing the iterators to 'unsigned long', 'long', or doing a cast in each macro. GCC even generates faster code if one of these two last changes are made. Also, the statement in matrix_dgemm_2 should be *c += a * (*b); for correctness.
[Bug tree-optimization/90292] New: GCC Fails to hoist loop invariant in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292 Bug ID: 90292 Summary: GCC Fails to hoist loop invariant in nested loops Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: giuliano.belinassi at usp dot br Target Milestone: --- Both GCC 8.3.0 and 9.0.1 fails to hoist the index calculation for 'i' and 'k' variables of the following function (matrix multiplication with blocking): void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A, double *restrict _B) { #define BLOCK_I 128 #define BLOCK_K 8 #define A(i, j) _A[n*(i) + (j)] #define B(i, j) _B[n*(i) + (j)] #define C(i, j) _C[n*(i) + (j)] unsigned ii, kk, i, j, k; for (ii = 0; ii < n; ii += BLOCK_I) for (kk = 0; kk < n; kk += BLOCK_K) for (i = ii; i < ii + BLOCK_I; ++i) for (k = kk; k < kk + BLOCK_K; ++k) for (j = 0; j < n; ++j) C(i, j) += A(i, k) * B(k, j); } which then generate the following GIMPLE code on the deepest nested block (-O2) [local count: 955630225]: # ivtmp.3_88 = PHI <_1(6), ivtmp.3_87(4)> _3 = (long unsigned int) ivtmp.3_88; _4 = _3 * 8; _5 = _C_34(D) + _4; _6 = *_5; _13 = ivtmp.14_81 + ivtmp.3_88; _14 = (long unsigned int) _13; _15 = _14 * 8; _16 = _B_36(D) + _15; _17 = *_16; _18 = _11 * _17; _19 = _6 + _18; *_5 = _19; ivtmp.3_87 = ivtmp.3_88 + 1; if (ivtmp.25_71 != ivtmp.3_87) goto ; [89.00%] else goto ; [11.00%] If I modify the function code to do the following: void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A, double *restrict _B) { #define BLOCK_I 128 #define BLOCK_K 8 #define A(i, j) _A[n*(i) + (j)] #define B(i, j) _B[n*(i) + (j)] #define C(i, j) _C[n*(i) + (j)] unsigned ii, kk, i, j, k; for (ii = 0; ii < n; ii += BLOCK_I) for (kk = 0; kk < n; kk += BLOCK_K) for (i = ii; i < ii + BLOCK_I; ++i) for (k = kk; k < kk + BLOCK_K; ++k) { double a = A(i, k); double* b = (k, 0); double* c = (i, 0); for (j = 0; j < n; ++j) { *c = a * (*b); c++; b++; } } } Then GCC generates the following deepest block: (-O2) [local count: 955630224]: # ivtmp.1_47 = PHI <0(5), ivtmp.1_46(6)> _11 = MEM[base: b_32, index: ivtmp.1_47, step: 8, offset: 0B]; _12 = _11 * a_30; MEM[base: c_34, index: ivtmp.1_47, step: 8, offset: 0B] = _12; ivtmp.1_46 = ivtmp.1_47 + 1; if (_44 != ivtmp.1_47) goto ; [89.00%] else goto ; [11.00%] and therefore it will generate faster final code. With a 2048x2048 matrix at -O2, this modification only gives about 0.3s of speedup, however at -O3 -march=native in a Core(TM) i5-8250U, this modification provides a 2x speedup, arguably due to vectorization.
[Bug libstdc++/86164] std::regex crashes when matching long lines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164 Giuliano Belinassi changed: What|Removed |Added CC||giuliano.belinassi at usp dot br --- Comment #7 from Giuliano Belinassi --- It seems that the issue is the backtracking required by the NFA, as it enters in a deep recursion when calling _M_dfs in _M_main_dispatch (regex_executor.tcc). Maybe moving the DFS stack from the recursion stack to the heap and use an iterative DFS could fix this, but converting the NFA to DFA may be a better choice, as it removes the backtracking requirement when iterating with the string.
[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 --- Comment #29 from Giuliano Belinassi --- > No, the proper fix would be to split the generated files and compile them in > parallel. Similarly for all the insn-*.c generated files. That would the > proper fix. Indeed. However, I am working on parallelizing the compilation with threads. This may lead to a solution, but may not be the best for this scenario. > Anyway, I like the graph you made :) Thank you. > But what version of GCC is this graph, with what exact configuration? * This is the gcc that I used to build: * Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 8.2.0-14' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 8.2.0 (Debian 8.2.0-14) * The gcc that I built: * Using built-in specs. COLLECT_GCC=./xgcc Target: x86_64-pc-linux-gnu Configured with: /home/giulianob/gcc_svn/trunk//configure --disable-checking --disable-bootstrap Thread model: posix gcc version 9.0.1 20190205 (experimental) (GCC)
[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 Giuliano Belinassi changed: What|Removed |Added CC||giuliano.belinassi at usp dot br --- Comment #26 from Giuliano Belinassi --- Created attachment 45630 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45630=edit make -j 64 all-gcc, with --disable-bootstrap, on 64-cores. Blue means dependency to gimple-match. Since gimple-match.c takes so long to compile, I was wondering if it might be possible to reorder the compilation so we can push its compilation early in the dependency graph. I did the following steps: 1) 'configure --disable-bootstrap' 2) 'make -j 64 all-gcc' 3) 'make clean'. 4) 'make gimple-match.o' using a wrapper[1] that I created to log all files required by gimple-match, and plotted the attached graphic. Here, blue means dependency and the largest bar is the 'gimple-match.c' itself. I used a 64 cores AMD Opteron 6376 in the process. Any ideas? [1] https://github.com/giulianobelinassi/gcc-timer-analysis
[Bug tree-optimization/88186] New: GCC Fails to optimize arithmetic progression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88186 Bug ID: 88186 Summary: GCC Fails to optimize arithmetic progression Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: giuliano.belinassi at usp dot br Target Milestone: --- Created attachment 45087 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45087=edit Somewhat a test case. GCC fails to optimize simple arithmetic progressions. See attached code. gcc -Ofast -o induction induction_test.c ./induction 1 Sum is 0, elapsed time: 0 Sum is 0, elapsed time: 0 ./induction 6 Sum is 179997, elapsed time: 203883 Sum is 179997, elapsed time: 0 It takes O(n), but the expected time would be O(1) after the optimization.
[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829 --- Comment #5 from Giuliano Belinassi --- I filed the copyright assignment request and sent it to the assign at gnu dot org. The patch itself I sent to the patch e-mail listing with the subject "patch to bug #86829".
[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829 --- Comment #3 from Giuliano Belinassi --- (In reply to Marc Glisse from comment #1) > > Do you have a copyright assignment (https://gcc.gnu.org/contribute.html) ? No. Sorry, but I think I may need help getting this right. Are there any tips? (In reply to Richard Biener from comment #2) > likely happens because you use float_type_node instead of 'type' for > build_one_cst. I think that if the atan intermediate result has additional > uses then the substitution may not be beneficial? Thus you should > write (atans:s @0) in both places. > > Otherwise this looks OK. Can you please post the patch to gcc-patches > and add testcases? I will fix these issues and add some test cases. I appreciate your help!
[Bug tree-optimization/86829] New: Missing sin(atan(x)) and cos(atan(x)) optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829 Bug ID: 86829 Summary: Missing sin(atan(x)) and cos(atan(x)) optimizations Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: giuliano.belinassi at usp dot br Target Milestone: --- Created attachment 44486 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44486=edit add sin(atan(x)) and cos(atan(x)) substitutions rules. The file named 'match.pd' does not contain the following simplifications rules: sin(atan(x)) -> x / sqrt(x*x + 1) and cos(atan(x)) -> 1 / sqrt(x*x + 1). According to the simple brenchmark I made, these substitutions can provide a 10x speedup in the code. I wrote a patch to add these optimizations. link to the perf test: https://pastebin.com/5ujSRmhq assembly dump of the perftest: https://pastebin.com/gLJeWHY8 The code I wrote add an instruction 'CVTSS2SD'. I don't know why it happens.