Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-19 Thread Artem Belevich via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL270094: [CUDA] Enable fusing FP ops (-ffp-contract=fast) for 
CUDA by default. (authored by tra).

Changed prior to commit:
  http://reviews.llvm.org/D20341?vs=57541=57833#toc

Repository:
  rL LLVM

http://reviews.llvm.org/D20341

Files:
  cfe/trunk/lib/Frontend/CompilerInvocation.cpp
  cfe/trunk/test/CodeGenCUDA/fp-contract.cu

Index: cfe/trunk/test/CodeGenCUDA/fp-contract.cu
===
--- cfe/trunk/test/CodeGenCUDA/fp-contract.cu
+++ cfe/trunk/test/CodeGenCUDA/fp-contract.cu
@@ -0,0 +1,32 @@
+// REQUIRES: x86-registered-target
+// REQUIRES: nvptx-registered-target
+
+// By default we should fuse multiply/add into fma instruction.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -disable-llvm-passes -o - %s | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=fast
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=fast -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=on -- fusing by front-end (disabled).
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=on -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+// Explicit -ffp-contract=off should disable instruction fusing.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=off -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+
+#include "Inputs/cuda.h"
+
+__host__ __device__ float func(float a, float b, float c) { return a + b * c; }
+// ENABLED:   fma.rn.f32
+// ENABLED-NEXT:  st.param.f32
+
+// DISABLED:  mul.rn.f32
+// DISABLED-NEXT: add.rn.f32
+// DISABLED-NEXT: st.param.f32
Index: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
===
--- cfe/trunk/lib/Frontend/CompilerInvocation.cpp
+++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp
@@ -2255,10 +2255,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// triple used for host compilation.
+if (LangOpts.CUDAIsDevice)
+  Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+
+// Set default FP_CONTRACT to FAST.
+if (!Args.hasArg(OPT_ffp_contract))
+  Res.getCodeGenOpts().setFPContractMode(CodeGenOptions::FPC_Fast);
   }
 
   // FIXME: Override value name discarding when asan or msan is used because 
the


Index: cfe/trunk/test/CodeGenCUDA/fp-contract.cu
===
--- cfe/trunk/test/CodeGenCUDA/fp-contract.cu
+++ cfe/trunk/test/CodeGenCUDA/fp-contract.cu
@@ -0,0 +1,32 @@
+// REQUIRES: x86-registered-target
+// REQUIRES: nvptx-registered-target
+
+// By default we should fuse multiply/add into fma instruction.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -disable-llvm-passes -o - %s | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=fast
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=fast -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=on -- fusing by front-end (disabled).
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=on -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+// Explicit -ffp-contract=off should disable instruction fusing.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=off -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+
+#include "Inputs/cuda.h"
+
+__host__ __device__ float func(float a, float b, float c) { return a + b * c; }
+// ENABLED:   fma.rn.f32
+// ENABLED-NEXT:  st.param.f32
+
+// DISABLED:  mul.rn.f32
+// DISABLED-NEXT: add.rn.f32
+// DISABLED-NEXT: st.param.f32
Index: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
===
--- cfe/trunk/lib/Frontend/CompilerInvocation.cpp
+++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp
@@ -2255,10 +2255,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// 

Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-19 Thread Artem Belevich via cfe-commits
tra added a subscriber: chandlerc.
tra added a comment.

Short version of offline discussion with @chandlerc : Default of 
-ffp-contract=fast for CUDA is fine.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-19 Thread Justin Lebar via cfe-commits
jlebar accepted this revision.
jlebar added a comment.
This revision is now accepted and ready to land.

Well, if the CUDA documentation says so...let's do it.  :)  Thanks for your 
patience, everyone.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-18 Thread Hal Finkel via cfe-commits
hfinkel added a comment.

In http://reviews.llvm.org/D20341#432586, @jlebar wrote:

> > But people also don't expect IEEE compliance on GPUs
>
>
> Is that true?


Yes.

> You have a lot more experience with this than I do, but my observation of 
> nvidia's hardware is that it's moved to add *more* IEEE compliance as it's 
> matured.  For example, older hardware didn't support denormals, but newer 
> chips do.  Surely that's in response to some users.


This is also true, but user expectations change slowly.

> One of our goals with CUDA in clang is to make device code as similar as 
> possible to host code.  Throwing out IEEE compliance seems counter to that 
> goal.

> 

> I also don't see the bright line here.  Like, if we can FMA to our heart's 
> content, where do we draw the line wrt IEEE compliance?  Do we turn on 
> flush-denormals-to-zero by default?  Do we use approximate transcendental 
> functions instead of the more accurate ones?  Do we assume floating point 
> arithmetic is associative?  What is the principle that leads us to do FMAs 
> but not these other optimizations?

> 

> In addition, CUDA != GPUs.  Maybe this is something to turn on by default for 
> NVPTX, although I'm still pretty uncomfortable with that.  Prior art in other 
> compilers is interesting, but I think it's notable that clang doesn't do this 
> for any other targets (afaict?) despite the fact that gcc does.

> 

> The main argument I see for this is "nvcc does it, and people will think 
> clang is slow if we don't".  That's maybe not a bad argument, but it makes me 
> sad.  :(




In http://reviews.llvm.org/D20341#433344, @tra wrote:

> I don't think using FMA throws away IEEE compliance.
>
> IEEE 784-2008 says:
>
> > A language standard should also define, and require implementations to 
> > provide, attributes that allow and
>
> >  disallow value-changing optimizations, separately or collectively, for a 
> > block. These optimizations might
>
> >  include, but are not limited to:
>
> >  ...
>
> >  ― Synthesis of a fusedMultiplyAdd operation from a multiplication and an 
> > addition
>
>
> It sounds like FMA use is up to user/language and IEEE standard is fine with 
> it either way.


That's correct. FMA formation is allowed, although the default for this, and 
how it's done is unfortunately a function of many aspects of the programming 
environment (language, target platform, etc.).

> We need to establish what is the language standard that we need to adhere to. 
> C++ standard itself does not seem to say much about FP precision or 
> particular FP format.

> 

> C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

> 

> > The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is 
> > implementation-defined.

> 

> 

> Nvidia has fairly detailed description of their FP.

>  http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

> 

> > The fused multiply-add operator on the GPU has high performance and 
> > increases the accuracy of computations. **No special flags or function 
> > calls are needed to gain this benefit in CUDA programs**. Understand that a 
> > hardware fused multiply-add operation is not yet available on the CPU, 
> > which can cause differences in numerical results.

> 

> 

> At the moment it's the most specific guideline I managed to find regarding 
> expected FP behavior applicable to CUDA.


I think this is the most important point. IEEE allows an implementation choice 
here, and users who already have working CUDA code have tested that code within 
that context. This is different from the host's choice (at least on x86), but 
users already expect this. There is a performance impact, but there's also a 
numerical impact, and I don't think we do our users any favors by differing 
from NVIDIA here.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-18 Thread Artem Belevich via cfe-commits
tra added a comment.

I don't think using FMA throws away IEEE compliance.

IEEE 784-2008 says:

> A language standard should also define, and require implementations to 
> provide, attributes that allow and

>  disallow value-changing optimizations, separately or collectively, for a 
> block. These optimizations might

>  include, but are not limited to:

>  ...

>  ― Synthesis of a fusedMultiplyAdd operation from a multiplication and an 
> addition


It sounds like FMA use is up to user/language and IEEE standard is fine with it 
either way.

We need to establish what is the language standard that we need to adhere to. 
C++ standard itself does not seem to say much about FP precision or particular 
FP format.

C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

> The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is 
> implementation-defined.


Nvidia has fairly detailed description of their FP.
http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

> The fused multiply-add operator on the GPU has high performance and increases 
> the accuracy of computations. **No special flags or function calls are needed 
> to gain this benefit in CUDA programs**. Understand that a hardware fused 
> multiply-add operation is not yet available on the CPU, which can cause 
> differences in numerical results.


At the moment it's the most specific guideline I managed to find regarding 
expected FP behavior applicable to CUDA.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Justin Lebar via cfe-commits
jlebar added a comment.

> But people also don't expect IEEE compliance on GPUs


Is that true?  You have a lot more experience with this than I do, but my 
observation of nvidia's hardware is that it's moved to add *more* IEEE 
compliance as it's matured.  For example, older hardware didn't support 
denormals, but newer chips do.  Surely that's in response to some users.

One of our goals with CUDA in clang is to make device code as similar as 
possible to host code.  Throwing out IEEE compliance seems counter to that goal.

I also don't see the bright line here.  Like, if we can FMA to our heart's 
content, where do we draw the line wrt IEEE compliance?  Do we turn on 
flush-denormals-to-zero by default?  Do we use approximate transcendental 
functions instead of the more accurate ones?  Do we assume floating point 
arithmetic is associative?  What is the principle that leads us to do FMAs but 
not these other optimizations?

In addition, CUDA != GPUs.  Maybe this is something to turn on by default for 
NVPTX, although I'm still pretty uncomfortable with that.  Prior art in other 
compilers is interesting, but I think it's notable that clang doesn't do this 
for any other targets (afaict?) despite the fact that gcc does.

The main argument I see for this is "nvcc does it, and people will think clang 
is slow if we don't".  That's maybe not a bad argument, but it makes me sad.  :(


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Artem Belevich via cfe-commits
tra updated this revision to Diff 57541.
tra added a comment.

Added test case.

Is there a better way to test that correct options are passed to back-end?
This test resorts to checking assembly generated by back-end which is way too 
far away from what actually needs testing.


http://reviews.llvm.org/D20341

Files:
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenCUDA/fp-contract.cu

Index: test/CodeGenCUDA/fp-contract.cu
===
--- /dev/null
+++ test/CodeGenCUDA/fp-contract.cu
@@ -0,0 +1,32 @@
+// REQUIRES: x86-registered-target
+// REQUIRES: nvptx-registered-target
+
+// By default we should fuse multiply/add into fma instruction.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -disable-llvm-passes -o - %s | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=fast
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=fast -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=on -- fusing by front-end (disabled).
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=on -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+// Explicit -ffp-contract=off should disable instruction fusing.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=off -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+
+#include "Inputs/cuda.h"
+
+__host__ __device__ float func(float a, float b, float c) { return a + b * c; }
+// ENABLED:   fma.rn.f32
+// ENABLED-NEXT:  st.param.f32
+
+// DISABLED:  mul.rn.f32
+// DISABLED-NEXT: add.rn.f32
+// DISABLED-NEXT: st.param.f32
Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -2212,10 +2212,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// triple used for host compilation.
+if (LangOpts.CUDAIsDevice)
+  Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+
+// Set default FP_CONTRACT to FAST.
+if (!Args.hasArg(OPT_ffp_contract))
+  Res.getCodeGenOpts().setFPContractMode(CodeGenOptions::FPC_Fast);
   }
 
   // FIXME: Override value name discarding when asan or msan is used because 
the


Index: test/CodeGenCUDA/fp-contract.cu
===
--- /dev/null
+++ test/CodeGenCUDA/fp-contract.cu
@@ -0,0 +1,32 @@
+// REQUIRES: x86-registered-target
+// REQUIRES: nvptx-registered-target
+
+// By default we should fuse multiply/add into fma instruction.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -disable-llvm-passes -o - %s | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=fast
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=fast -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix ENABLED %s
+
+// Explicit -ffp-contract=on -- fusing by front-end (disabled).
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=on -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+// Explicit -ffp-contract=off should disable instruction fusing.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx-nvidia-cuda -S \
+// RUN:   -ffp-contract=off -disable-llvm-passes -o - %s \
+// RUN:   | FileCheck -check-prefix DISABLED %s
+
+
+#include "Inputs/cuda.h"
+
+__host__ __device__ float func(float a, float b, float c) { return a + b * c; }
+// ENABLED:   fma.rn.f32
+// ENABLED-NEXT:  st.param.f32
+
+// DISABLED:  mul.rn.f32
+// DISABLED-NEXT: add.rn.f32
+// DISABLED-NEXT: st.param.f32
Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -2212,10 +2212,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// triple used for host compilation.
+if (LangOpts.CUDAIsDevice)
+  Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+
+// Set default FP_CONTRACT to FAST.
+if 

Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Artem Belevich via cfe-commits
tra updated this revision to Diff 57540.
tra added a comment.

Changed default to -ffp-contract=fast.


http://reviews.llvm.org/D20341

Files:
  lib/Frontend/CompilerInvocation.cpp

Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -2212,10 +2212,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// triple used for host compilation.
+if (LangOpts.CUDAIsDevice)
+  Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+
+// Set default FP_CONTRACT to FAST.
+if (!Args.hasArg(OPT_ffp_contract))
+  Res.getCodeGenOpts().setFPContractMode(CodeGenOptions::FPC_Fast);
   }
 
   // FIXME: Override value name discarding when asan or msan is used because 
the


Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -2212,10 +2212,15 @@
   LangOpts.ObjCExceptions = 1;
   }
 
-  // During CUDA device-side compilation, the aux triple is the triple used for
-  // host compilation.
-  if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
-Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+  if (LangOpts.CUDA) {
+// During CUDA device-side compilation, the aux triple is the
+// triple used for host compilation.
+if (LangOpts.CUDAIsDevice)
+  Res.getTargetOpts().HostTriple = Res.getFrontendOpts().AuxTriple;
+
+// Set default FP_CONTRACT to FAST.
+if (!Args.hasArg(OPT_ffp_contract))
+  Res.getCodeGenOpts().setFPContractMode(CodeGenOptions::FPC_Fast);
   }
 
   // FIXME: Override value name discarding when asan or msan is used because the
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Artem Belevich via cfe-commits
tra added a comment.

OK. Consensus seems to be that -ffp-contract=fast is the way to go. I'll update 
the patch.
I've just checked Steve's example with nvcc and indeed it fused mul+add.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Hal Finkel via cfe-commits
hfinkel added a comment.



In http://reviews.llvm.org/D20341#432525, @tra wrote:

> In http://reviews.llvm.org/D20341#432494, @hfinkel wrote:
>
> >
>
>
>
>
> > That having been said, is this change the equivalent of -ffp-contract=fast 
> > or -ffp-contract=on? I think it is the latter and we want the former (i.e. 
> > where we let the backend be as aggressive as possible *after* inlining).
>
>
> It is -ffp-contract=on. As it happens, it appears to produce better code 
> compared to -ffp-contract=fast at least on some benchmarks. Apparently 
> smaller IR (smaller number of intrinsic call instructions vs multiple 
> separate mul+add) makes job easier for straight line strength reduction pass 
> and it's able to remove more redundant calculations in unrolled loops.


That's certainly interesting, and frankly, something I don't immediately 
understand. Given that, at that level, the IR for -ffo-contract=fast is the 
same as -ffp-contract=off, this seems to point to some more-general problem 
that we should likely fix anyway.

I will say that, once templated C++ libraries become involved, the 
per-statement C rules for fusion often don't apply in enough places to be 
useful. You really need to perform the fusion after inlining. Obviously, 
however, for more-directly-programmed expressions, this concern does not apply.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Steve Canon via cfe-commits
scanon added a comment.

`-ffp-contract=on` obeys the semantics of C's FP_CONTRACT pragma.  In 
particular, it will not fuse:

  float m = x*y;
  float a = m + z;

Whereas you probably want that to fuse for your purposes.  `-ffp-contract=fast` 
seems more in line with your needs.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Artem Belevich via cfe-commits
tra added a comment.

In http://reviews.llvm.org/D20341#432494, @hfinkel wrote:

>




> That having been said, is this change the equivalent of -ffp-contract=fast or 
> -ffp-contract=on? I think it is the latter and we want the former (i.e. where 
> we let the backend be as aggressive as possible *after* inlining).


It is -ffp-contract=on. As it happens, it appears to produce better code 
compared to -ffp-contract=fast at least on some benchmarks. Apparently smaller 
IR (smaller number of intrinsic call instructions vs multiple separate mul+add) 
makes job easier for straight line strength reduction pass and it's able to 
remove more redundant calculations in unrolled loops.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Artem Belevich via cfe-commits
tra added a subscriber: scanon.
tra added a comment.

Things are even more interesting. -ffp-contract=fast is *not* what this change 
does. :-)

We have two places where we can fuse FP instructions -- in clang and in LLVM 
back-end.
Clang fuses add+mul into llvm.fmuladd intrinsic if -ffp-contract=on (default) 
and DefaultFPContract=1 (which is only set for OpenCL for some reason) and 
back-end then decides whether it's profitable to emit fused operation or not. 
NVPTX does emit fmad.

Compare this to -ffp-contract=fast which actually *disables* fusing in clang 
and instead allows LLVM backend to do fusing wherever it sees fit (as opposed 
to 'fuse intrinsics only'. It may potentially fuse any suitable multiply/add 
pair, not only those vetted by front-end.

Currently there's no way to enable front-end fusing via command line, unless 
you compile OpenCL source. With this patch in place for CUDA compilation we can 
pick either no fusing, controlled fusing by front-end or more aggressive fusing 
by back-end.

Setting DefaultFPContract=1 for CUDA seems to be the least evil -- it's 
somewhat controlled in scope and gives us a way to disable fusing completely or 
make it more aggressive if it's needed.

Perhaps @scanon and @hfinkel can weigh in.


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Hal Finkel via cfe-commits
hfinkel added a subscriber: hfinkel.
hfinkel added a comment.

In http://reviews.llvm.org/D20341#432461, @jlebar wrote:

> I am not sure we want this?  Although it matches nvcc, it does not match our 
> floating-point behavior for C++ in general -- it makes us non-IEEE-whatever 
> compliant by default.
>
> Although I agree that if we don't do this, lots of people are not going to 
> pass -fp-contract=fast and resultantly will think that we're slower than 
> nvcc.  There's no way to win.  :(


But people also don't expect IEEE compliance on GPUs, and also, the system 
default for forming FMAs has long been system specific. The default on IBM 
systems, for example, is generally the equivalent of -ffp-contract=fast (in 
both XLC and GCC).

That having been said, is this change the equivalent of -ffp-contract=fast or 
-ffp-contract=on? I think it is the latter and we want the former (i.e. where 
we let the backend be as aggressive as possible *after* inlining).


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

2016-05-17 Thread Justin Lebar via cfe-commits
jlebar added a comment.

I am not sure we want this?  Although it matches nvcc, it does not match our 
floating-point behavior for C++ in general -- it makes us non-IEEE-whatever 
compliant by default.

Although I agree that if we don't do this, lots of people are not going to pass 
-fp-contract=fast and resultantly will think that we're slower than nvcc.  
There's no way to win.  :(


http://reviews.llvm.org/D20341



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits