Re: How to get GCC on par with ICC?

2018-06-22 Thread Szabolcs Nagy

On 11/06/18 11:05, Martin Jambor wrote:

The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference,
around 20%.  521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.



Keep in mind that when discussing FP benchmarks, the used math library
can be (almost) as important as the compiler.  In the case of 481.wrf,
we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
performance is about 70% of ICC's.  When we just linked against AMD's
libm, we got to 83%. When we instructed GCC to generate calls to Intel's
SVML library and linked against it, we got to 91%.  Using both SVML and
AMD's libm, we achieved 93%.



i think glibc 2.27 should outperform amd's libm on wrf
(since i upstreamed the single precision code from
https://github.com/ARM-software/optimized-routines/ )

the 83% -> 93% diff is because gcc fails to vectorize
math calls in fortran to libmvec calls.


That means that there likely still is 7% to be gained from more clever
optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
is perhaps the most extreme example but definitely not the only one.


there is no longer a problem in gnu libm for the most
common single precision calls and if things go well
then glibc 2.28 will get double precision improvements
too.

but gcc has to learn how to use libmvec in fortran.


Re: How to get GCC on par with ICC?

2018-06-21 Thread Steve Ellcey
On Wed, 2018-06-20 at 17:11 -0400, NightStrike wrote:
> 
> If I could perhaps jump in here for a moment...  Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.  They all involve a lot of heavy
> use of std::vector>.  Comparisons were with gcc
> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.  So, as close to not-noisy as
> possible.

There are a quite a number of bugzilla reports with examples where GCC
does not vectorize a loop.  I wonder if this example is related to PR
61247.

Steve Ellcey


Re: How to get GCC on par with ICC?

2018-06-21 Thread Richard Biener
On Wed, Jun 20, 2018 at 11:12 PM NightStrike  wrote:
>
> On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill  wrote:
> >
> > On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> > pmenzel+gcc.gnu@molgen.mpg.de> wrote:
> >
> > > Dear GCC folks,
> > >
> > >
> > > Some scientists in our organization still want to use the Intel compiler,
> > > as they say, it produces faster code, which is then executed on clusters.
> > > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > > heavily dependent on the actual program.)
> > >
> >
> > Do they have specific examples where icc is better for them? Or can point
> > to specific GCC PRs which impact them?
> >
> >
> > GCC versions?
> >
> > Are there specific CPU model variants of concern?
> >
> > What flags are used to compile? Some times a bit of advice can produce
> > improvements.
> >
> > Without specific examples, it is hard to set goals.
>
> If I could perhaps jump in here for a moment...  Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.  They all involve a lot of heavy
> use of std::vector>.  Comparisons were with gcc

Ick - C++ ;)

> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.  So, as close to not-noisy as
> possible.
>
> I was surprised at the results results, but using each compiler's methods of
> dumping vectorization info, intel wins on two points:
>
> 1) It actually vectorizes
> 2) It's vectorizing output is much more easily readable
>
> Options were:
>
> gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native
>
> vs:
>
> icc -Ofast -std=gnu++14
>
>
> So, not exactly exact, but pretty close.
>
>
> So here's an example of a chunk of code (not very readable, sorry
> about that) that intel can vectorize, and subsequently make about 50%
> faster:
>
> std::size_t nLayers { input.nn.size() };
> //std::size_t ySize = std::max_element(input.nn.cbegin(),
> input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
> })->size();
> std::size_t ySize = 0;
> for (auto const & nn: input.nn)
> ySize = std::max(ySize, nn.size());
>
> float yNorm[ySize];
> for (auto & y: yNorm)
> y = 0.0f;
> for (std::size_t i = 0; i < xSize; ++i)
> yNorm[i] = xNorm[i];
> for (std::size_t layer = 0; layer < nLayers; ++layer) {
> auto & nn = input.nn[layer];
> auto & b = nn.back();
> float y[ySize];
> for (std::size_t i = 0; i < nn[0].size(); ++i) {
> y[i] = b[i];
> for (std::size_t j = 0; j < nn.size() - 1; ++j)
> y[i] += nn.at(j).at(i) * yNorm[j];
> }
> for (std::size_t i = 0; i < ySize; ++i) {
> if (layer < nLayers - 1)
> y[i] = std::max(y[i], 0.0f);
> yNorm[i] = y[i];
> }
> }
>
>
> If I was better at godbolt, I could show the asm, but I'm not.  I'm
> willing to learn, though.

A compilable testcase would be more useful - just file a bugzilla.

Richard.


Re: How to get GCC on par with ICC?

2018-06-20 Thread NightStrike
On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill  wrote:
>
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> pmenzel+gcc.gnu@molgen.mpg.de> wrote:
>
> > Dear GCC folks,
> >
> >
> > Some scientists in our organization still want to use the Intel compiler,
> > as they say, it produces faster code, which is then executed on clusters.
> > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > heavily dependent on the actual program.)
> >
>
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
>
>
> GCC versions?
>
> Are there specific CPU model variants of concern?
>
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
>
> Without specific examples, it is hard to set goals.

If I could perhaps jump in here for a moment...  Just today I hit upon
a series of small (in lines of code) loops that gcc can't vectorize,
and intel vectorizes like a madman.  They all involve a lot of heavy
use of std::vector>.  Comparisons were with gcc
8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
as sched_FIFO, mlockall, affinity set to its own core, and all
interrupts vectored off that core.  So, as close to not-noisy as
possible.

I was surprised at the results results, but using each compiler's methods of
dumping vectorization info, intel wins on two points:

1) It actually vectorizes
2) It's vectorizing output is much more easily readable

Options were:

gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native

vs:

icc -Ofast -std=gnu++14


So, not exactly exact, but pretty close.


So here's an example of a chunk of code (not very readable, sorry
about that) that intel can vectorize, and subsequently make about 50%
faster:

std::size_t nLayers { input.nn.size() };
//std::size_t ySize = std::max_element(input.nn.cbegin(),
input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
})->size();
std::size_t ySize = 0;
for (auto const & nn: input.nn)
ySize = std::max(ySize, nn.size());

float yNorm[ySize];
for (auto & y: yNorm)
y = 0.0f;
for (std::size_t i = 0; i < xSize; ++i)
yNorm[i] = xNorm[i];
for (std::size_t layer = 0; layer < nLayers; ++layer) {
auto & nn = input.nn[layer];
auto & b = nn.back();
float y[ySize];
for (std::size_t i = 0; i < nn[0].size(); ++i) {
y[i] = b[i];
for (std::size_t j = 0; j < nn.size() - 1; ++j)
y[i] += nn.at(j).at(i) * yNorm[j];
}
for (std::size_t i = 0; i < ySize; ++i) {
if (layer < nLayers - 1)
y[i] = std::max(y[i], 0.0f);
yNorm[i] = y[i];
}
}


If I was better at godbolt, I could show the asm, but I'm not.  I'm
willing to learn, though.


Re: How to get GCC on par with ICC?

2018-06-15 Thread Joseph Myers
On Fri, 15 Jun 2018, Jeff Law wrote:

> And resolution on -fno-math-errno as the default.  Setting errno can be
> more expensive than people realize.

I don't think I saw any version of the -fno-math-errno patch proposal that 
included the testsuite updates I'd expect.  Certainly 
gcc.dg/torture/pr68264.c tests libm functions setting errno and would need 
to use -fmath-errno explicitly, but it seems likely there are other tests 
involving built-in functions that in fact only test what they're intended 
to test given -fmath-errno; tests using libm functions without explicit 
-ffast-math / -fmath-errno / -fno-math-errno would need review (and there 
should be new tests for optimizations that are only valid given 
-fno-math-errno).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: How to get GCC on par with ICC?

2018-06-15 Thread Jeff Law
On 06/15/2018 05:39 AM, Wilco Dijkstra wrote:
> Martin wrote:
> 
>> Keep in mind that when discussing FP benchmarks, the used math library
>> can be (almost) as important as the compiler.  In the case of 481.wrf,
>> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
>> performance is about 70% of ICC's.  When we just linked against AMD's
>> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
>> SVML library and linked against it, we got to 91%.  Using both SVML and
>> AMD's libm, we achieved 93%.
>>
>> That means that there likely still is 7% to be gained from more clever
>> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
>> is perhaps the most extreme example but definitely not the only one.
> 
> You really should retry with GLIBC 2.27 since several key math functions were
> rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in 
> huge
> performance gains on all targets (eg. wrf improved over 50%).
> 
> I fixed several double precision functions in current GLIBC to avoid 
> extremely bad
> performance which had been complained about for years. There are more math
> functions on the way, so the GNU libm will not only catch up, but become the 
> fastest
> math library available.
And resolution on -fno-math-errno as the default.  Setting errno can be
more expensive than people realize.

Jeff


Re: How to get GCC on par with ICC?

2018-06-15 Thread Wilco Dijkstra
Martin wrote:

> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler.  In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's.  When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%.  Using both SVML and
> AMD's libm, we achieved 93%.
>
> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
> is perhaps the most extreme example but definitely not the only one.

You really should retry with GLIBC 2.27 since several key math functions were
rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in 
huge
performance gains on all targets (eg. wrf improved over 50%).

I fixed several double precision functions in current GLIBC to avoid extremely 
bad
performance which had been complained about for years. There are more math
functions on the way, so the GNU libm will not only catch up, but become the 
fastest
math library available.

Wilco

Re: How to get GCC on par with ICC?

2018-06-11 Thread Martin Jambor
Hi Steve,

On Fri, Jun 08 2018, Steve Ellcey wrote:
> On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>> 
>> When we do our own comparisons of GCC vs. ICC on benchmarks
>> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
>> (in fact it even trails in some benchmarks) unless you get to
>> "SPEC tricks" like data structure re-organization optimizations that
>> probably never apply in practice on real-world code (and people
>> should fix such things at the source level being pointed at them
>> via actually profiling their codes).
>
> Richard,
>
> I was wondering if you have any more details about these comparisions
> you have done that you can share?  Compiler versions, options used,
> hardware, etc  Also, were there any tests that stood out in terms of
> icc outperforming GCC?

Mostly AMD Ryzen, GCC 8 vs ICC 18.  We were comparing a few combinations
of options.  When we compared ICC's and our -Ofast (with or without
native GCC march/mtune and a set ICC options that hopefully generate
best code on for Ryzen), we found out that without LTO/IPO, GCC is
actually slightly ahead of ICC on integer benchmarks (both SPEC 2006 and
2017).

Floating-point results were a more mixed bag (mostly because ICC
performed surprisingly poorly without IPO on a few) but at least on SPEC
2017, they were clearly better... with a caveat, see below my comment
about wrf.

With LTO/IPO, ICC can perform a few memory-reorg tricks that push them
quite a bit ahead of us but I'm not convinced they can perform these
transformations on much source code that happens not to be a well known
benchmark.  So I'd recommend always looking at non-IPO numbers too.

>
> I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
> a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
> I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
> for gcc.

Please try with -Ofast too.  The main reason is that -O3 does not imply
-ffast-math and the performance gain from it is often very big (and I
suspect the 525.x264_r difference is because of that).  Alternatively,
if your own workloads require high-precision floating-point math, you
have to force ICC to use it to get a fair comparison.  -Ofast also turns
on -fno-protect-parens and -fstack-arrays that also help a few
benchmarks a lot but note that you may need to set large stack ulimit
for them not to crash (but ICC does the same thing, as far as we know).

>
> The int rate numbers (running 1 copy only) were not too bad, GCC was
> only about 2% slower and only 525.x264_r seemed way slower with GCC.
> The fp rate numbers (again only 1 copy) showed a larger difference, 
> around 20%.  521.wrf_r was more than twice as slow when compiled with
> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
> significant slowdowns when compiled with GCC vs. ICC.
>

Keep in mind that when discussing FP benchmarks, the used math library
can be (almost) as important as the compiler.  In the case of 481.wrf,
we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
performance is about 70% of ICC's.  When we just linked against AMD's
libm, we got to 83%. When we instructed GCC to generate calls to Intel's
SVML library and linked against it, we got to 91%.  Using both SVML and
AMD's libm, we achieved 93%.

That means that there likely still is 7% to be gained from more clever
optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
is perhaps the most extreme example but definitely not the only one.

Martin


Re: How to get GCC on par with ICC?

2018-06-08 Thread Marc Glisse

On Fri, 8 Jun 2018, Steve Ellcey wrote:


On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:

 
When we do our own comparisons of GCC vs. ICC on benchmarks
like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
(in fact it even trails in some benchmarks) unless you get to
"SPEC tricks" like data structure re-organization optimizations that
probably never apply in practice on real-world code (and people
should fix such things at the source level being pointed at them
via actually profiling their codes).


Richard,

I was wondering if you have any more details about these comparisions
you have done that you can share?  Compiler versions, options used,
hardware, etc  Also, were there any tests that stood out in terms of
icc outperforming GCC?

I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
for gcc.


You should use -Ofast for gcc. As mentionned earlier in the discussion, 
ICC has some equivalent of -ffast-math by default.



The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference, 
around 20%.  521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.


--
Marc Glisse


Re: How to get GCC on par with ICC?

2018-06-08 Thread Steve Ellcey
On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
> 
> When we do our own comparisons of GCC vs. ICC on benchmarks
> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
> (in fact it even trails in some benchmarks) unless you get to
> "SPEC tricks" like data structure re-organization optimizations that
> probably never apply in practice on real-world code (and people
> should fix such things at the source level being pointed at them
> via actually profiling their codes).

Richard,

I was wondering if you have any more details about these comparisions
you have done that you can share?  Compiler versions, options used,
hardware, etc  Also, were there any tests that stood out in terms of
icc outperforming GCC?

I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
for gcc.

The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference, 
around 20%.  521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.

Steve Ellcey
sell...@cavium.com


Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 5:52 PM Paul Menzel
 wrote:
>
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel
> compiler, as they say, it produces faster code, which is then executed
> on clusters. Some resources on the Web [1][2] confirm this. (I am aware,
> that it’s heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation
> and direct contact with the processor designers?

They will of course have an edge in timing when supporting a new architecture
because they have access to NDA material and hardware.  For example the
OSS community doesn't yet have access to any AVX512 capable machine
(speaking of the GNU compile-farm), and those are prohibitly expensive
for a private contributor.

Similar stories apply to the access to proprietary benchmarks or simply
having resources to continuously work with folks in HPC to make sure ICC
works great for their codes.

> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more
> GCC developers needed?

I think a big part of the story is perception and training.  This means that
for example a coherent and up-to-date source for information on how
to use GCC in a HPC environment (optimizing your code, recommended
compiler options, pitfalls to avoid, etc.) is desperately missing.

When we do our own comparisons of GCC vs. ICC on benchmarks
like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
(in fact it even trails in some benchmarks) unless you get to
"SPEC tricks" like data structure re-organization optimizations that
probably never apply in practice on real-world code (and people
should fix such things at the source level being pointed at them
via actually profiling their codes).

In my own experience which dates back nearly 15 years now ICC is
buggy (generates wrong-code / simulation results) and cannot compile
a "simple" C++ program ;)  This made me start working on GCC.

Note that the very best strength of GCC is the first-class high-quality
(insert more buzzwords here) support infrastructure if you actually
run into issues with the compiler!  Even when using paid ICC I never
got timely fixes (if at all) for wrong-code issues I reported to them!

I've separately replied to specific points in other posts where ICC has
an edge over GCC.

Richard.

>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280=rep1=pdf
>


Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 8:31 PM Ryan Burn  wrote:
>
> One case where ICC can generate much faster code sometimes is by using
> the nontemporal pragma [https://software.intel.com/en-us/node/524559]
> with loops.
>
> AFAIK, there's no such equivalent pragma in gcc
> [https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].
>
> When I tried this simple example
> https://github.com/rnburn/square_timing/blob/master/bench.cpp that
> measures times for this loop:
>
> void compute(const double* x, index_t N, double* y) {
>   #pragma vector nontemporal
>   for(index_t i=0; i }
>
>  with and without nontemporal I got these times (N = 1,000,000)
>
> Temporal 1,042,080
> Non-Temporal 538,842
>
> So running with the non-temporal pragma was nearly twice as fast.
>
> An equivalent non-temporal pragma for GCC would, IMO, certainly be a
> very good feature to add.

GCC has robust infrastructure for loop pragmas now just the set of pragmas
available isn't very big.  It would be interesting to know which ICC ones people
use regularly so we can support those in GCC as well.

Note using #pragmas is very much hand-optimizing the code for the compiler
you use - sth that is possible for GCC as well.

Richard.

> On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin  wrote:
> > Dear Paul,
> >
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
> > Kind regards,
> > - Dmitry Mikushin | Applied Parallel Computing LLC |
> > https://parallel-computing.pro
> >
> >
> > 2018-06-06 18:51 GMT+03:00 Paul Menzel :
> >
> >> Dear GCC folks,
> >>
> >>
> >> Some scientists in our organization still want to use the Intel compiler,
> >> as they say, it produces faster code, which is then executed on clusters.
> >> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> >> heavily dependent on the actual program.)
> >>
> >> My question is, is it realistic, that GCC could catch up and that the
> >> scientists will start to use it over Intel’s compiler? Or will Intel
> >> developers always have the lead, because they have secret documentation and
> >> direct contact with the processor designers?
> >>
> >> If it is realistic, how can we get there? Would first the program be
> >> written, and then the compiler be optimized for that? Or are just more GCC
> >> developers needed?
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> [1]: https://colfaxresearch.com/compiler-comparison/
> >> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> >> .1280=rep1=pdf
> >>
> >>


Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 11:10 PM Zan Lynx  wrote:
>
> On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
>
> Please note that if your compute cluster uses different models of CPU,
> be extremely careful with -march=native.
>
> I've been bitten by it in VMs, several times. Unless you always run on
> the same system that did the build, you are running a risk of illegal
> instructions.

Yes.  Note this is where ICC has an advantage because it supports
automagically doing runtime versioning based on the CPU instruction
set for vectorized loops.  We only support that in an awkward
explicit way (the manual talks about this in the 'Function Multiversioning'
section).

But in the end it's just a "detail" that can be worked around with
a little inconvenience ;)  (I've yet to see a heterogenous cluster
where the instruction set differences make a performance difference
over choosing the lowest common one)

Richard.

> --
> Knowledge is Power -- Power Corrupts
> Study Hard -- Be Evil


Re: How to get GCC on par with ICC?

2018-06-06 Thread Zan Lynx
On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
> 

Please note that if your compute cluster uses different models of CPU,
be extremely careful with -march=native.

I've been bitten by it in VMs, several times. Unless you always run on
the same system that did the build, you are running a risk of illegal
instructions.

-- 
Knowledge is Power -- Power Corrupts
Study Hard -- Be Evil


Re: How to get GCC on par with ICC?

2018-06-06 Thread Ryan Burn
One case where ICC can generate much faster code sometimes is by using
the nontemporal pragma [https://software.intel.com/en-us/node/524559]
with loops.

AFAIK, there's no such equivalent pragma in gcc
[https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].

When I tried this simple example
https://github.com/rnburn/square_timing/blob/master/bench.cpp that
measures times for this loop:

void compute(const double* x, index_t N, double* y) {
  #pragma vector nontemporal
  for(index_t i=0; i wrote:
> Dear Paul,
>
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
>
> Kind regards,
> - Dmitry Mikushin | Applied Parallel Computing LLC |
> https://parallel-computing.pro
>
>
> 2018-06-06 18:51 GMT+03:00 Paul Menzel :
>
>> Dear GCC folks,
>>
>>
>> Some scientists in our organization still want to use the Intel compiler,
>> as they say, it produces faster code, which is then executed on clusters.
>> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
>> heavily dependent on the actual program.)
>>
>> My question is, is it realistic, that GCC could catch up and that the
>> scientists will start to use it over Intel’s compiler? Or will Intel
>> developers always have the lead, because they have secret documentation and
>> direct contact with the processor designers?
>>
>> If it is realistic, how can we get there? Would first the program be
>> written, and then the compiler be optimized for that? Or are just more GCC
>> developers needed?
>>
>>
>> Kind regards,
>>
>> Paul
>>
>>
>> [1]: https://colfaxresearch.com/compiler-comparison/
>> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
>> .1280=rep1=pdf
>>
>>


Re: How to get GCC on par with ICC?

2018-06-06 Thread Dmitry Mikushin
Dear Paul,

The opinion you've mentioned is common in scientific community. However, in
more detail it often surfaces that the used set of GCC compiler options
simply does not correspond to that "fast" version of Intel. For instance,
when you do "-O3" for Intel it actually corresponds to (at least) "-O3
-ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
introduces significant performance gap.

Kind regards,
- Dmitry Mikushin | Applied Parallel Computing LLC |
https://parallel-computing.pro


2018-06-06 18:51 GMT+03:00 Paul Menzel :

> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280=rep1=pdf
>
>


Re: How to get GCC on par with ICC?

2018-06-06 Thread Bin.Cheng
On Wed, Jun 6, 2018 at 3:51 PM, Paul Menzel
 wrote:
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler, as
> they say, it produces faster code, which is then executed on clusters. Some
> resources on the Web [1][2] confirm this. (I am aware, that it’s heavily
> dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
There are developers actually working on performance optimization in
GCC so you are not the only one :).  As an opensource compiler we do
lack resource so more developers is always good for the project.  As
Joel pointed out, typical/reduced workload showing the performance gap
is very important for our developers as well as attracting new
developers.  We can probably open a meta-bug for tracking if you have
many of these example workloads.

Thanks,
bin
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280=rep1=pdf
>


Re: How to get GCC on par with ICC?

2018-06-06 Thread Paul Menzel

Dear Joel,


Thank you for your quick reply.


On 06/06/18 17:57, Joel Sherrill wrote:

On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel wrote:



Some scientists in our organization still want to use the Intel compiler,
as they say, it produces faster code, which is then executed on clusters.
Some resources on the Web [1][2] confirm this. (I am aware, that it’s
heavily dependent on the actual program.)


Do they have specific examples where icc is better for them? Or can point
to specific GCC PRs which impact them?

GCC versions?

Are there specific CPU model variants of concern?

What flags are used to compile? Some times a bit of advice can produce
improvements.

Without specific examples, it is hard to set goals.


I could get such examples, but it will take some time, as it’s from 
other institutes.


The clusters use exclusively Intel processors. (Hopefully, that will 
change.)


I also found the article from the German Linux-Magazin in an English 
version at the ADMIN Magazin [3]. The German article had a more strong 
statement, that they use the Intel compilers due to performance reasons.



My question is, is it realistic, that GCC could catch up and that the
scientists will start to use it over Intel’s compiler? Or will Intel
developers always have the lead, because they have secret documentation and
direct contact with the processor designers?

If it is realistic, how can we get there? Would first the program be
written, and then the compiler be optimized for that? Or are just more GCC
developers needed?


For sure examples are needed so there are test cases to use for reference.

If you want anything improved in any free software project, sponsoring
developers is always a good thing. If you sponsor the right developers. :)


That’s what I hoped for, but didn’t ask here. If you could point me to a 
list of possible contractors, that would be great.


Please keep in mind, that in my organization certain decisions are made 
*very* slowly. I’ll try to get answers quickly, but procuring finances 
might take longer (half a year or much longer).



Kind regards,

Paul



[1]: https://colfaxresearch.com/compiler-comparison/
[2]: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280=rep1=pdf
[3] 
http://www.admin-magazine.com/HPC/Articles/Selecting-Compilers-for-a-Supercomputer 
   "HPC Compilers"




smime.p7s
Description: S/MIME Cryptographic Signature


Re: How to get GCC on par with ICC?

2018-06-06 Thread Joel Sherrill
On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
pmenzel+gcc.gnu@molgen.mpg.de> wrote:

> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>

Do they have specific examples where icc is better for them? Or can point
to specific GCC PRs which impact them?

GCC versions?

Are there specific CPU model variants of concern?

What flags are used to compile? Some times a bit of advice can produce
improvements.

Without specific examples, it is hard to set goals.


> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>

For sure examples are needed so there are test cases to use for reference.

If you want anything improved in any free software project, sponsoring
developers
is always a good thing. If you sponsor the right developers. :)

I'm not discouraging you. I just trying to turn this into something
actionable.

--joel sherrill


>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280=rep1=pdf
>
>