[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck

2019-11-07 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402

--- Comment #32 from Giuliano Belinassi  ---
(In reply to Eric Gallager from comment #31)
> I think this came up at Cauldron, but I forget what exactly people said
> about it...

Actually this PR comes before Cauldron 2019. One way to fix this issue is to
make the match.pd parser output several smaller gimple-match.c, and add these
to the Makefile. Also repeat this procedure to other big files.

Another solution is to parallelize GCC internals and make GCC communicate with
Make somehow so that when a CPU is idle, it starts compiling some files in
parallel.

[Bug tree-optimization/90292] GCC Fails to hoist loop invariant in nested loops

2019-04-30 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292

--- Comment #2 from Giuliano Belinassi  ---
Just for the sake of completeness, this issue is not addressed by just changing
the iterators to 'int'. However, it is in fact solved by changing the iterators
to 'unsigned long', 'long', or doing a cast in each macro. GCC even generates
faster code if one of these two last changes are made.

Also, the statement in matrix_dgemm_2 should be  *c += a * (*b); for
correctness.

[Bug tree-optimization/90292] New: GCC Fails to hoist loop invariant in nested loops

2019-04-30 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292

Bug ID: 90292
   Summary: GCC Fails to hoist loop invariant in nested loops
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: giuliano.belinassi at usp dot br
  Target Milestone: ---

Both GCC 8.3.0 and 9.0.1 fails to hoist the index calculation for 'i' and 'k'
variables of the following function (matrix multiplication with blocking):

void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A,
double *restrict _B)
{
#define BLOCK_I 128
#define BLOCK_K 8

#define A(i, j) _A[n*(i) + (j)]
#define B(i, j) _B[n*(i) + (j)]
#define C(i, j) _C[n*(i) + (j)]
unsigned ii, kk, i, j, k;

for (ii = 0; ii < n; ii += BLOCK_I)
for (kk = 0; kk < n; kk += BLOCK_K)
for (i = ii; i < ii + BLOCK_I; ++i)
for (k = kk; k < kk + BLOCK_K; ++k)
for (j = 0; j < n; ++j)
   C(i, j) += A(i, k) * B(k, j);
}

which then generate the following GIMPLE code on the deepest nested block (-O2)

   [local count: 955630225]:
  # ivtmp.3_88 = PHI <_1(6), ivtmp.3_87(4)>
  _3 = (long unsigned int) ivtmp.3_88;
  _4 = _3 * 8;
  _5 = _C_34(D) + _4; 
  _6 = *_5;
  _13 = ivtmp.14_81 + ivtmp.3_88;
  _14 = (long unsigned int) _13;
  _15 = _14 * 8;
  _16 = _B_36(D) + _15;
  _17 = *_16;
  _18 = _11 * _17;
  _19 = _6 + _18;
  *_5 = _19;
  ivtmp.3_87 = ivtmp.3_88 + 1;
  if (ivtmp.25_71 != ivtmp.3_87)
goto ; [89.00%]
  else
goto ; [11.00%]

If I modify the function code to do the following:

void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A,
double *restrict _B) 
{
#define BLOCK_I 128
#define BLOCK_K 8

#define A(i, j) _A[n*(i) + (j)]
#define B(i, j) _B[n*(i) + (j)]
#define C(i, j) _C[n*(i) + (j)]
unsigned ii, kk, i, j, k;

for (ii = 0; ii < n; ii += BLOCK_I)
for (kk = 0; kk < n; kk += BLOCK_K)
for (i = ii; i < ii + BLOCK_I; ++i)
for (k = kk; k < kk + BLOCK_K; ++k)
{
double  a =  A(i, k); 
double* b = (k, 0); 
double* c = (i, 0); 
for (j = 0; j < n; ++j)
{
*c = a * (*b);
c++; b++;
}
}
}

Then GCC generates the following deepest block: (-O2)

   [local count: 955630224]:
  # ivtmp.1_47 = PHI <0(5), ivtmp.1_46(6)>
  _11 = MEM[base: b_32, index: ivtmp.1_47, step: 8, offset: 0B];
  _12 = _11 * a_30;
  MEM[base: c_34, index: ivtmp.1_47, step: 8, offset: 0B] = _12;
  ivtmp.1_46 = ivtmp.1_47 + 1;
  if (_44 != ivtmp.1_47)
goto ; [89.00%]
  else
goto ; [11.00%]

and therefore it will generate faster final code. With a 2048x2048 matrix at
-O2, this modification only gives about 0.3s of speedup, however at -O3
-march=native in a  Core(TM) i5-8250U, this modification provides a 2x speedup,
arguably due to vectorization.

[Bug libstdc++/86164] std::regex crashes when matching long lines

2019-03-29 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164

Giuliano Belinassi  changed:

   What|Removed |Added

 CC||giuliano.belinassi at usp dot 
br

--- Comment #7 from Giuliano Belinassi  ---
It seems that the issue is the backtracking required by the NFA, as it enters
in a deep recursion when calling _M_dfs in _M_main_dispatch
(regex_executor.tcc).

Maybe moving the DFS stack from the recursion stack to the heap and use an
iterative DFS could fix this, but converting the NFA to DFA may be a better
choice, as it removes the backtracking requirement when iterating with the
string.

[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck

2019-02-07 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402

--- Comment #29 from Giuliano Belinassi  ---
> No, the proper fix would be to split the generated files and compile them in 
> parallel. Similarly for all the insn-*.c generated files. That would the 
> proper fix.

Indeed. However, I am working on parallelizing the compilation with threads.
This may lead to a solution, but may not be the best for this scenario.

> Anyway, I like the graph you made :)

Thank you.

> But what version of GCC is this graph, with what exact configuration?

* This is the gcc that I used to build: *

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.2.0-14'
--with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 8.2.0 (Debian 8.2.0-14) 

* The gcc that I built: *

Using built-in specs.
COLLECT_GCC=./xgcc
Target: x86_64-pc-linux-gnu
Configured with: /home/giulianob/gcc_svn/trunk//configure --disable-checking
--disable-bootstrap
Thread model: posix
gcc version 9.0.1 20190205 (experimental) (GCC)

[Bug bootstrap/84402] [meta] GCC build system: parallelism bottleneck

2019-02-07 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402

Giuliano Belinassi  changed:

   What|Removed |Added

 CC||giuliano.belinassi at usp dot 
br

--- Comment #26 from Giuliano Belinassi  ---
Created attachment 45630
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45630=edit
make -j 64 all-gcc, with --disable-bootstrap, on 64-cores. Blue means
dependency to gimple-match.

Since gimple-match.c takes so long to compile, I was wondering if it might be
possible to reorder the compilation so we can push its compilation early in the
dependency graph.

I did the following steps: 
 1) 'configure --disable-bootstrap'
 2) 'make -j 64 all-gcc'
 3) 'make clean'. 
 4) 'make gimple-match.o' using a wrapper[1] that I created to log all files
required by gimple-match, and plotted the attached graphic. Here, blue means
dependency and the largest bar is the 'gimple-match.c' itself.

I used a 64 cores AMD Opteron 6376 in the process.

Any ideas?

[1] https://github.com/giulianobelinassi/gcc-timer-analysis

[Bug tree-optimization/88186] New: GCC Fails to optimize arithmetic progression

2018-11-24 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88186

Bug ID: 88186
   Summary: GCC Fails to optimize arithmetic progression
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: giuliano.belinassi at usp dot br
  Target Milestone: ---

Created attachment 45087
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45087=edit
Somewhat a test case.

GCC fails to optimize simple arithmetic progressions. See attached code.

gcc -Ofast -o induction induction_test.c

./induction 1
Sum is 0, elapsed time: 0
Sum is 0, elapsed time: 0

./induction 6
Sum is 179997, elapsed time: 203883
Sum is 179997, elapsed time: 0

It takes O(n), but the expected time would be O(1) after the optimization.

[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations

2018-08-13 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829

--- Comment #5 from Giuliano Belinassi  ---
I filed the copyright assignment request and sent it to the assign at gnu dot
org. The patch itself I sent to the patch e-mail listing with the subject
"patch to bug #86829".

[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations

2018-08-02 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829

--- Comment #3 from Giuliano Belinassi  ---
(In reply to Marc Glisse from comment #1)
> 
> Do you have a copyright assignment (https://gcc.gnu.org/contribute.html) ?

No. Sorry, but I think I may need help getting this right. Are there any tips?

(In reply to Richard Biener from comment #2)
> likely happens because you use float_type_node instead of 'type' for
> build_one_cst.  I think that if the atan intermediate result has additional
> uses then the substitution may not be beneficial?  Thus you should
> write (atans:s @0) in both places.
> 
> Otherwise this looks OK.  Can you please post the patch to gcc-patches
> and add testcases?

I will fix these issues and add some test cases. I appreciate your help!

[Bug tree-optimization/86829] New: Missing sin(atan(x)) and cos(atan(x)) optimizations

2018-08-02 Thread giuliano.belinassi at usp dot br
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829

Bug ID: 86829
   Summary: Missing sin(atan(x)) and cos(atan(x)) optimizations
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: giuliano.belinassi at usp dot br
  Target Milestone: ---

Created attachment 44486
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44486=edit
add sin(atan(x)) and cos(atan(x)) substitutions rules.

The file named 'match.pd' does not contain the following simplifications rules:
sin(atan(x)) -> x / sqrt(x*x + 1)  and  cos(atan(x)) -> 1 / sqrt(x*x + 1).
According to the simple brenchmark I made, these substitutions can provide  a
10x speedup in the code. I wrote a patch to add these optimizations.

link to the perf test: https://pastebin.com/5ujSRmhq

assembly dump of the perftest: https://pastebin.com/gLJeWHY8

The code I wrote add an instruction 'CVTSS2SD'. I don't know why it happens.