from:"Artem Belevich via Phabricator via cfe\-commits"

[PATCH] D141375: [SYCL][OpenMP] Fix compilation errors for unsupported __bf16 intrinsics

2023-09-07 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/Sema.cpp:1978-1979
  !Context.getTargetInfo().hasInt128Type()) ||
+(Ty->isBFloat16Type() && !Context.getTargetInfo().hasBFloat16Type() &&
+ !LangOpts.CUDAIsDevice) ||
 LongDoubleMismatched) {

eandrews wrote:
> tra wrote:
> > @eandrews Do you recall what was the reason for *not* issuing the 
> > diagnostic on the GPU side?
> > 
> > It appears to do the opposite to what the patch description says. We're 
> > supposed to  `emit error for unsupported type __bf16 in device code`, but 
> > instead we specifically ignore it during GPU-side compilation. What am I 
> > missing?
> > 
> > 
> > 
> I don't recall the specifics but I think CUDA had code handling __bf16 
> differently and this change broke a test with CUDA diagnostics and so I 
> excluded it from the patch. I could try removing this check and seeing what 
> breaks if you'd like. 
It may have been around the time when x86 started exposing bf16 type in the 
host headers, but NVPTX didn't have any support for the type yet.

This change may have just papered over the problem. Oh, well. That would be 
just one of the places where we currently don't handle the 'unusual' types 
across the host/GPU boundary. I'm attempting to clean it up, and will take care 
of this instance there.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141375/new/

https://reviews.llvm.org/D141375

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D141375: [SYCL][OpenMP] Fix compilation errors for unsupported __bf16 intrinsics

2023-09-07 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.
Herald added a subscriber: jplehr.



Comment at: clang/lib/Sema/Sema.cpp:1978-1979
  !Context.getTargetInfo().hasInt128Type()) ||
+(Ty->isBFloat16Type() && !Context.getTargetInfo().hasBFloat16Type() &&
+ !LangOpts.CUDAIsDevice) ||
 LongDoubleMismatched) {

@eandrews Do you recall what was the reason for *not* issuing the diagnostic on 
the GPU side?

It appears to do the opposite to what the patch description says. We're 
supposed to  `emit error for unsupported type __bf16 in device code`, but 
instead we specifically ignore it during GPU-side compilation. What am I 
missing?





Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141375/new/

https://reviews.llvm.org/D141375

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158778: [CUDA] Propagate __float128 support from the host.

2023-08-29 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

In D158778#4626181 , @jhuber6 wrote:

> Just doing a simple example here https://godbolt.org/z/Y3E58PKMz shows that 
> for NVPTX we error out (as I would expect) but for AMDGPU we emit an x86 
> 80-bit double.

With this patch NVPTX will behave the same as AMDGPU and we'll no longer error 
out.

I think I may need to explicitly add a diagnostics for the case where the host 
idea of long double and __float128 does not match that of the target. It would 
have to be specific to OpenMP, as CUDA does expect this discrepancy for 
historic reasons.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158778/new/

https://reviews.llvm.org/D158778

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158778: [CUDA] Propagate __float128 support from the host.

2023-08-29 Thread Artem Belevich via Phabricator via cfe-commits

tra added a subscriber: jhuber6.
tra added a comment.

In D158778#4624408 , @ABataev wrote:

> Just checks removal should be fine

Looks like OpenMP handles long double and __float128 differently -- it always 
insists on using the host's FP format for both.
https://github.com/llvm/llvm-project/blob/d037445f3a2c6dc1842b5bfc1d5d81988c2f223d/clang/lib/AST/ASTContext.cpp#L1674

This creates a divergence between what clang thinks and what LLVM can handle.
I'm not quite sure how it's supposed to work with NVPTX or AMDGPU, where we 
demote those types to double and can't generate code for the actual types.

@jhuber6 what does OpenMP expect to happen for those types on the GPU side?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158778/new/

https://reviews.llvm.org/D158778

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158778: [CUDA] Propagate __float128 support from the host.

2023-08-28 Thread Artem Belevich via Phabricator via cfe-commits

tra added a subscriber: ABataev.
tra added a comment.

@ABataev

This patch breaks breaks two tests:

- 
github.com/llvm/llvm-project/blob/main/clang/test/OpenMP/nvptx_unsupported_type_codegen.cpp
- 
github.com/llvm/llvm-project/blob/main/clang/test/OpenMP/nvptx_unsupported_type_messages.cpp

It's not clear what exactly these tests are testing for and I can't tell 
whether I should just remove the checks related to `__float128`, or if there's 
something else that would need to be done on the OpenMP side.

AFAICT, OpenMP will pick up `double` format for `__float128` after my patch. 
This suggests that we would only have `long double` left as an unsupported type 
on GPU-supporting targets, which suggests that I should just remove the checks 
related to `__float128` from those tests.

Am I missing something? Is there anything else that may need to be done on the 
OpenMP side?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158778/new/

https://reviews.llvm.org/D158778

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158778: [CUDA] Propagate __float128 support from the host.

2023-08-24 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Also, https://github.com/llvm/llvm-project/issues/46903


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158778/new/

https://reviews.llvm.org/D158778

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158778: [CUDA] Propagate __float128 support from the host.

2023-08-24 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, kerbowa, bixia, tpr, 
yaxunl, jvesely.
Herald added a project: All.
tra edited the summary of this revision.
tra published this revision for review.
tra added reviewers: jlebar, yaxunl.
tra added a comment.
Herald added subscribers: cfe-commits, jholewinski.
Herald added a project: clang.

For some context about why it's needed see 
https://github.com/compiler-explorer/compiler-explorer/pull/5373#issuecomment-1687127788
The short version is that currently CUDA compilation is broken w/ clang with 
unpatched libstdc++. Ubuntu and Debian patch libstdc++ to avoid the problem, 
but this should be handled by clang.


GPUs do not have actual FP128 support, but we do need to be able to compile
host-side headers which use `__float128`. On the GPU side we'll downgrade 
`__float128`
to double, similarly to how we handle `long double`. Both types will have
different in-memory representation compared to their host counterparts and are
not expected to be interchangeable across host/device boundary.

Also see https://reviews.llvm.org/D78513 which applied equivalent change to
HIP/AMDGPU.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D158778

Files:
  clang/lib/Basic/Targets/NVPTX.cpp
  clang/test/SemaCUDA/amdgpu-f128.cu
  clang/test/SemaCUDA/f128.cu


Index: clang/test/SemaCUDA/f128.cu
===
--- clang/test/SemaCUDA/f128.cu
+++ clang/test/SemaCUDA/f128.cu
@@ -1,4 +1,5 @@
 // RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -aux-triple 
x86_64-unknown-linux-gnu -fcuda-is-device -fsyntax-only -verify %s
+// RUN: %clang_cc1 -triple nvptx64 -aux-triple x86_64-unknown-linux-gnu 
-fcuda-is-device -fsyntax-only -verify %s
 
 // expected-no-diagnostics
 typedef __float128 f128_t;
Index: clang/lib/Basic/Targets/NVPTX.cpp
===
--- clang/lib/Basic/Targets/NVPTX.cpp
+++ clang/lib/Basic/Targets/NVPTX.cpp
@@ -142,6 +142,20 @@
   // we need all classes to be defined on both the host and device.
   MaxAtomicInlineWidth = HostTarget->getMaxAtomicInlineWidth();
 
+  // For certain builtin types support on the host target, claim they are
+  // support to pass the compilation of the host code during the device-side
+  // compilation.
+  //
+  // FIXME: As the side effect, we also accept `__float128` uses in the device
+  // code, but use 'double' as the underlying type, so host/device
+  // representation of the type is different. This is similar to what happens 
to
+  // long double.
+
+  if (HostTarget->hasFloat128Type()) {
+HasFloat128 = true;
+Float128Format = DoubleFormat;
+  }
+
   // Properties intentionally not copied from host:
   // - LargeArrayMinWidth, LargeArrayAlign: Not visible across the
   //   host/device boundary.


Index: clang/test/SemaCUDA/f128.cu
===
--- clang/test/SemaCUDA/f128.cu
+++ clang/test/SemaCUDA/f128.cu
@@ -1,4 +1,5 @@
 // RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-gnu -fcuda-is-device -fsyntax-only -verify %s
+// RUN: %clang_cc1 -triple nvptx64 -aux-triple x86_64-unknown-linux-gnu -fcuda-is-device -fsyntax-only -verify %s
 
 // expected-no-diagnostics
 typedef __float128 f128_t;
Index: clang/lib/Basic/Targets/NVPTX.cpp
===
--- clang/lib/Basic/Targets/NVPTX.cpp
+++ clang/lib/Basic/Targets/NVPTX.cpp
@@ -142,6 +142,20 @@
   // we need all classes to be defined on both the host and device.
   MaxAtomicInlineWidth = HostTarget->getMaxAtomicInlineWidth();
 
+  // For certain builtin types support on the host target, claim they are
+  // support to pass the compilation of the host code during the device-side
+  // compilation.
+  //
+  // FIXME: As the side effect, we also accept `__float128` uses in the device
+  // code, but use 'double' as the underlying type, so host/device
+  // representation of the type is different. This is similar to what happens to
+  // long double.
+
+  if (HostTarget->hasFloat128Type()) {
+HasFloat128 = true;
+Float128Format = DoubleFormat;
+  }
+
   // Properties intentionally not copied from host:
   // - LargeArrayMinWidth, LargeArrayAlign: Not visible across the
   //   host/device boundary.
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D157750: Properly handle -fsplit-machine-functions for fatbinary compilation

2023-08-21 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/test/Driver/fsplit-machine-functions-with-cuda-nvptx.c:9
+
+// Check that -fsplit-machine-functions is passed to both x86 and cuda 
compilation and does not cause driver error.
+// MFS2: -fsplit-machine-functions

MaskRay wrote:
> shenhan wrote:
> > MaskRay wrote:
> > > MaskRay wrote:
> > > > tra wrote:
> > > > > shenhan wrote:
> > > > > > tra wrote:
> > > > > > > shenhan wrote:
> > > > > > > > tra wrote:
> > > > > > > > > We will still see a warning, right? So, for someone compiling 
> > > > > > > > > with `-Werror` that's going to be a problem.
> > > > > > > > > 
> > > > > > > > > Also, if the warning is issued from the top-level driver, we 
> > > > > > > > > may not even be able to suppress it when we disable splitting 
> > > > > > > > > on GPU side with `-Xarch_device -fno-split-machine-functions`.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > We will still see a warning, right?
> > > > > > > > Yes, there still will be a warning. We've discussed it and we 
> > > > > > > > think that pass -fsplit-machine-functions in this case is not a 
> > > > > > > > proper usage and a warning is warranted, and it is not good 
> > > > > > > > that skip doing split silently while uses explicitly ask for it.
> > > > > > > > 
> > > > > > > > > Also, if the warning is issued from the top-level driver
> > > > > > > > The warning will not be issued from the top-level driver, it 
> > > > > > > > will be issued when configuring optimization passes.
> > > > > > > > So:
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   - -fsplit-machine-functions -Xarch_device 
> > > > > > > > -fno-split-machine-functions
> > > > > > > > Will enable MFS for host, disable MFS for gpus and without any 
> > > > > > > > warnings.
> > > > > > > > 
> > > > > > > >   - -Xarch_host -fsplit-machine-functions
> > > > > > > > The same as the above
> > > > > > > > 
> > > > > > > >   - -Xarch_host -fsplit-machine-functions -Xarch_device 
> > > > > > > > -fno-split-machine-functions
> > > > > > > > The same as the above
> > > > > > > > 
> > > > > > > > We've discussed it and we think that pass 
> > > > > > > > -fsplit-machine-functions in this case is not a proper usage 
> > > > > > > > and a warning is warranted, and it is not good that skip doing 
> > > > > > > > split silently while uses explicitly ask for it.
> > > > > > > 
> > > > > > > I would agree with that assertion if we were talking exclusively 
> > > > > > > about CUDA compilation.
> > > > > > > However, a common real world use pattern is that the flags are 
> > > > > > > set globally for all C++ compilations, and then CUDA compilations 
> > > > > > > within the project need to do whatever they need to to keep 
> > > > > > > things working. The original user intent was for the option to 
> > > > > > > affect the host compilation. There's no inherent assumption that 
> > > > > > > it will do anything useful for the GPU.
> > > > > > > 
> > > > > > > In number of similar cases in the past we did settle on silently 
> > > > > > > ignoring some top-level flags that we do expect to encounter in 
> > > > > > > real projects, but which made no sense for the GPU. E.g. 
> > > > > > > sanitizers. If the project is built w/ sanitizer enabled, the 
> > > > > > > idea is to sanitize the host code, The GPU code continues to be 
> > > > > > > built w/o sanitizer enabled. 
> > > > > > > 
> > > > > > > Anyways, as long as we have a way to deal with it it's not a big 
> > > > > > > deal one way or another.
> > > > > > > 
> > > > > > > > -fsplit-machine-functions -Xarch_device 
> > > > > > > > -fno-split-machine-functions
> > > > > > > > Will enable MFS for host, disable MFS for gpus and without any 
> > > > > > > > warnings.
> > > > > > > 
> > > > > > > OK. This will work.
> > > > > > > 
> > > > > > > 
> > > > > > > In number of similar cases in the past we did settle on silently 
> > > > > > > ignoring some top-level flags that we do expect to encounter in 
> > > > > > > real projects, but which made no sense for the GPU. E.g. 
> > > > > > > sanitizers. If the project is built w/ sanitizer enabled, the 
> > > > > > > idea is to sanitize the host code, The GPU code continues to be 
> > > > > > > built w/o sanitizer enabled.
> > > > > > 
> > > > > > Can I understand it this way - if the compiler is **only** building 
> > > > > > for CPUs, then silently ignore any optimization flags is not a good 
> > > > > > behavior. If the compiler is building CPUs and GPUs, it is still 
> > > > > > not a good behavior to silently ignore optimization flags for CPUs, 
> > > > > > but it is probably ok to silently ignore optimization flags for 
> > > > > > GPUs.
> > > > > > 
> > > > > > > OK. This will work.
> > > > > > Thanks for confirming.
> > > > > >  it is probably ok to silently ignore optimization flags for GPUs.
> > > > > 
> > > > > In this case, yes. 
> > > > > 
> > > > > I think the most consistent way to handle the situation is to

[PATCH] D158238: Implement __builtin_fmaximum/fminimum*

2023-08-18 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 551646.
tra added a comment.

Fixed test RUN lines


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158238/new/

https://reviews.llvm.org/D158238

Files:
  clang/include/clang/Basic/Builtins.def
  clang/lib/AST/ExprConstant.cpp
  clang/lib/AST/Interp/InterpBuiltin.cpp
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGen/builtins.c
  clang/test/CodeGen/math-builtins.c
  clang/test/Sema/constant-builtins-fmaximum.cpp
  clang/test/Sema/constant-builtins-fminimum.cpp
  clang/test/Sema/overloaded-math-builtins.c

Index: clang/test/Sema/overloaded-math-builtins.c
===
--- clang/test/Sema/overloaded-math-builtins.c
+++ clang/test/Sema/overloaded-math-builtins.c
@@ -19,3 +19,21 @@
   int *r6 = __builtin_fminf(f, v);
   // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
 }
+
+float test_fminimumf(float f, int i, int *ptr, float4 v) {
+  float r1 = __builtin_fminimumf(f, ptr);
+  // expected-error@-1 {{passing 'int *' to parameter of incompatible type 'float'}}
+  float r2 = __builtin_fminimumf(ptr, f);
+  // expected-error@-1 {{passing 'int *' to parameter of incompatible type 'float'}}
+  float r3 = __builtin_fminimumf(v, f);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+  float r4 = __builtin_fminimumf(f, v);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+
+
+  int *r5 = __builtin_fminimumf(f, f);
+  // expected-error@-1 {{initializing 'int *' with an expression of incompatible type 'float'}}
+
+  int *r6 = __builtin_fminimumf(f, v);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+}
Index: clang/test/Sema/constant-builtins-fminimum.cpp
===
--- /dev/null
+++ clang/test/Sema/constant-builtins-fminimum.cpp
@@ -0,0 +1,56 @@
+// RUN: %clang_cc1 -std=c++17 -fsyntax-only -verify %s
+// RUN: %clang_cc1 -std=c++17 -fsyntax-only -verify -fexperimental-new-constant-interpreter %s
+// expected-no-diagnostics
+
+constexpr double NaN = __builtin_nan("");
+constexpr double Inf = __builtin_inf();
+constexpr double NegInf = -__builtin_inf();
+
+#define FMIN_TEST_SIMPLE(T, FUNC)  \
+  static_assert(T(1.2345) == FUNC(T(1.2345), T(6.7890)));  \
+  static_assert(T(1.2345) == FUNC(T(6.7890), T(1.2345)));
+
+#define FMIN_TEST_NAN(T, FUNC) \
+  static_assert(__builtin_isnan(FUNC(NaN, Inf)));  \
+  static_assert(__builtin_isnan(FUNC(NegInf, NaN)));   \
+  static_assert(__builtin_isnan(FUNC(NaN, 0.0)));  \
+  static_assert(__builtin_isnan(FUNC(-0.0, NaN))); \
+  static_assert(__builtin_isnan(FUNC(NaN, T(-1.2345;   \
+  static_assert(__builtin_isnan(FUNC(T(1.2345), NaN)));\
+  static_assert(__builtin_isnan(FUNC(NaN, NaN)));
+
+#define FMIN_TEST_INF(T, FUNC) \
+  static_assert(NegInf == FUNC(NegInf, Inf));  \
+  static_assert(0.0 == FUNC(Inf, 0.0));\
+  static_assert(-0.0 == FUNC(-0.0, Inf));  \
+  static_assert(T(1.2345) == FUNC(Inf, T(1.2345)));\
+  static_assert(T(-1.2345) == FUNC(T(-1.2345), Inf));
+
+#define FMIN_TEST_NEG_INF(T, FUNC) \
+  static_assert(NegInf == FUNC(Inf, NegInf));  \
+  static_assert(NegInf == FUNC(NegInf, 0.0));  \
+  static_assert(NegInf == FUNC(-0.0, NegInf)); \
+  static_assert(NegInf == FUNC(NegInf, T(-1.2345)));   \
+  static_assert(NegInf == FUNC(T(1.2345), NegInf));
+
+#define FMIN_TEST_BOTH_ZERO(T, FUNC)   \
+  static_assert(__builtin_copysign(1.0, FUNC(0.0, 0.0)) == 1.0);   \
+  static_assert(__builtin_copysign(1.0, FUNC(-0.0, 0.0)) == -1.0); \
+  static_assert(__builtin_copysign(1.0, FUNC(0.0, -0.0)) == -1.0); \
+  static_assert(__builtin_copysign(1.0, FUNC(-0.0, -0.0)) == -1.0);
+
+#define LIST_FMIN_TESTS(T, FUNC)   \
+  FMIN_TEST_SIMPLE(T, FUNC)\
+  FMIN_TEST_NAN(T, FUNC)   \
+  FMIN_TEST_INF(T, FUNC)   \
+  FMIN_TEST_NEG_INF(T, FUNC)

[PATCH] D158238: Implement __builtin_fmaximum/fminimum*

2023-08-18 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

@fhahn who else should take a look at the patch?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158238/new/

https://reviews.llvm.org/D158238

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158238: Implement __builtin_fmaximum/fminimum*

2023-08-18 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added a subscriber: bixia.
Herald added a project: All.
tra updated this revision to Diff 551336.
tra added a comment.
tra updated this revision to Diff 551338.
tra published this revision for review.
tra added a reviewer: fhahn.
Herald added subscribers: cfe-commits, StephenFan.
Herald added a project: clang.

formatting cleanup


tra added a comment.

clang-formatted new tests.


The builtins provide FP min/max conforming to IEEE754-2018 standard.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D158238

Files:
  clang/include/clang/Basic/Builtins.def
  clang/lib/AST/ExprConstant.cpp
  clang/lib/AST/Interp/InterpBuiltin.cpp
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGen/builtins.c
  clang/test/CodeGen/math-builtins.c
  clang/test/Sema/constant-builtins-fmaximum.cpp
  clang/test/Sema/constant-builtins-fminimum.cpp
  clang/test/Sema/overloaded-math-builtins.c

Index: clang/test/Sema/overloaded-math-builtins.c
===
--- clang/test/Sema/overloaded-math-builtins.c
+++ clang/test/Sema/overloaded-math-builtins.c
@@ -19,3 +19,21 @@
   int *r6 = __builtin_fminf(f, v);
   // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
 }
+
+float test_fminimumf(float f, int i, int *ptr, float4 v) {
+  float r1 = __builtin_fminimumf(f, ptr);
+  // expected-error@-1 {{passing 'int *' to parameter of incompatible type 'float'}}
+  float r2 = __builtin_fminimumf(ptr, f);
+  // expected-error@-1 {{passing 'int *' to parameter of incompatible type 'float'}}
+  float r3 = __builtin_fminimumf(v, f);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+  float r4 = __builtin_fminimumf(f, v);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+
+
+  int *r5 = __builtin_fminimumf(f, f);
+  // expected-error@-1 {{initializing 'int *' with an expression of incompatible type 'float'}}
+
+  int *r6 = __builtin_fminimumf(f, v);
+  // expected-error@-1 {{passing 'float4' (vector of 4 'float' values) to parameter of incompatible type 'float'}}
+}
Index: clang/test/Sema/constant-builtins-fminimum.cpp
===
--- /dev/null
+++ clang/test/Sema/constant-builtins-fminimum.cpp
@@ -0,0 +1,56 @@
+// RUN: %clang_cc1 -std=c++17 -fsyntax-only -verify %s
+// RUN: %clang_cc1 -std=c++17 -fsyntax-only -verify
+// -fexperimental-new-constant-interpreter %s expected-no-diagnostics
+
+constexpr double NaN = __builtin_nan("");
+constexpr double Inf = __builtin_inf();
+constexpr double NegInf = -__builtin_inf();
+
+#define FMIN_TEST_SIMPLE(T, FUNC)  \
+  static_assert(T(1.2345) == FUNC(T(1.2345), T(6.7890)));  \
+  static_assert(T(1.2345) == FUNC(T(6.7890), T(1.2345)));
+
+#define FMIN_TEST_NAN(T, FUNC) \
+  static_assert(__builtin_isnan(FUNC(NaN, Inf)));  \
+  static_assert(__builtin_isnan(FUNC(NegInf, NaN)));   \
+  static_assert(__builtin_isnan(FUNC(NaN, 0.0)));  \
+  static_assert(__builtin_isnan(FUNC(-0.0, NaN))); \
+  static_assert(__builtin_isnan(FUNC(NaN, T(-1.2345;   \
+  static_assert(__builtin_isnan(FUNC(T(1.2345), NaN)));\
+  static_assert(__builtin_isnan(FUNC(NaN, NaN)));
+
+#define FMIN_TEST_INF(T, FUNC) \
+  static_assert(NegInf == FUNC(NegInf, Inf));  \
+  static_assert(0.0 == FUNC(Inf, 0.0));\
+  static_assert(-0.0 == FUNC(-0.0, Inf));  \
+  static_assert(T(1.2345) == FUNC(Inf, T(1.2345)));\
+  static_assert(T(-1.2345) == FUNC(T(-1.2345), Inf));
+
+#define FMIN_TEST_NEG_INF(T, FUNC) \
+  static_assert(NegInf == FUNC(Inf, NegInf));  \
+  static_assert(NegInf == FUNC(NegInf, 0.0));  \
+  static_assert(NegInf == FUNC(-0.0, NegInf)); \
+  static_assert(NegInf == FUNC(NegInf, T(-1.2345)));   \
+  static_assert(NegInf == FUNC(T(1.2345), NegInf));
+
+#define FMIN_TEST_BOTH_ZERO(T, FUNC)   \
+  static_assert(__builtin_copysign(1.0, FUNC(0.0, 0.0)) == 1.0);   \
+  static_assert(__builtin_copysign(1.0, FUNC(-0.0, 0.0)) == -1.0); \
+  static_assert(__builtin_copysign(1.0, FUNC(0.0, -0.0)) == -1.0); \
+  static_assert(__builtin_copysign(1.0, FUNC(-0.0, -0.0)) == -1.0);
+
+#define LIST_FMIN_TESTS(T, FUNC)

[PATCH] D158226: [CUDA/NVPTX] Improve handling of memcpy for -Os compilations.

2023-08-18 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG72757343fa86: [CUDA/NVPTX] Improve handling of memcpy for 
-Os compilations. (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158226/new/

https://reviews.llvm.org/D158226

Files:
  clang/test/CodeGenCUDA/memcpy-libcall.cu
  llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp


Index: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
===
--- llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -386,9 +386,9 @@
   // always lower memset, memcpy, and memmove intrinsics to load/store
   // instructions, rather
   // then generating calls to memset, mempcy or memmove.
-  MaxStoresPerMemset = (unsigned) 0x;
-  MaxStoresPerMemcpy = (unsigned) 0x;
-  MaxStoresPerMemmove = (unsigned) 0x;
+  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = (unsigned)0x;
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = (unsigned) 0x;
+  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = (unsigned) 0x;
 
   setBooleanContents(ZeroOrNegativeOneBooleanContent);
   setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
Index: clang/test/CodeGenCUDA/memcpy-libcall.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/memcpy-libcall.cu
@@ -0,0 +1,61 @@
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -O3 -S %s -o - | FileCheck -check-prefix=PTX %s
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -Os -S %s -o - | FileCheck -check-prefix=PTX %s
+#include "Inputs/cuda.h"
+
+// PTX-LABEL: .func _Z12copy_genericPvPKv(
+void __device__ copy_generic(void *dest, const void *src) {
+  __builtin_memcpy(dest, src, 32);
+// PTX:ld.u8
+// PTX:st.u8
+}
+
+// PTX-LABEL: .entry _Z11copy_globalPvS_(
+void __global__ copy_global(void *dest, void * src) {
+  __builtin_memcpy(dest, src, 32);
+// PTX:ld.global.u8
+// PTX:st.global.u8
+}
+
+struct S {
+  int data[8];
+};
+
+// PTX-LABEL: .entry _Z20copy_param_to_globalP1SS_(
+void __global__ copy_param_to_global(S *global, S param) {
+  __builtin_memcpy(global, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.global.u32
+}
+
+// PTX-LABEL: .entry _Z19copy_param_to_localPU3AS51SS_(
+void __global__ copy_param_to_local(__attribute__((address_space(5))) S *local,
+S param) {
+  __builtin_memcpy(local, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.local.u32
+}
+
+// PTX-LABEL: .func _Z21copy_local_to_genericP1SPU3AS5S_(
+void __device__ copy_local_to_generic(S *generic,
+ __attribute__((address_space(5))) S *src) 
{
+  __builtin_memcpy(generic, src, sizeof(S));
+// PTX:ld.local.u32
+// PTX:st.u32
+}
+
+__shared__ S shared;
+
+// PTX-LABEL: .entry _Z20copy_param_to_shared1S(
+void __global__ copy_param_to_shared( S param) {
+  __builtin_memcpy(, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.shared.u32
+}
+
+void __device__ copy_shared_to_generic(S *generic) {
+  __builtin_memcpy(generic, , sizeof(S));
+// PTX:ld.shared.u32
+// PTX:st.u32
+}


Index: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
===
--- llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -386,9 +386,9 @@
   // always lower memset, memcpy, and memmove intrinsics to load/store
   // instructions, rather
   // then generating calls to memset, mempcy or memmove.
-  MaxStoresPerMemset = (unsigned) 0x;
-  MaxStoresPerMemcpy = (unsigned) 0x;
-  MaxStoresPerMemmove = (unsigned) 0x;
+  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = (unsigned)0x;
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = (unsigned) 0x;
+  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = (unsigned) 0x;
 
   setBooleanContents(ZeroOrNegativeOneBooleanContent);
   setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
Index: clang/test/CodeGenCUDA/memcpy-libcall.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/memcpy-libcall.cu
@@ -0,0 +1,61 @@
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -O3 -S %s -o - | FileCheck -check-prefix=PTX %s
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -Os -S %s -o - | FileCheck -check-prefix=PTX %s
+#include "Inputs/cuda.h"
+
+// PTX-LABEL: .func _Z12copy_genericPvPKv(
+void __device__ copy_generic(void *dest, const void *src) {
+  __builtin_memcpy(dest, src, 32);
+// PTX:

[PATCH] D158247: [CUDA][HIP] Fix overloading resolution in global variable initializer

2023-08-18 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Same reproducer but for CUDA: https://godbolt.org/z/WhjTMffnx




Comment at: clang/include/clang/Sema/Sema.h:4753
+  /// Otherwise, use \p D to determiine the host/device target.
   bool CheckCallingConvAttr(const ParsedAttr , CallingConv ,
+const FunctionDecl *FD = nullptr,

It appears that `Declarator D` here is only used as an attribute carrier used 
to identify CUDA calling target.
Should we pass `CudaTarget ContextTarget` instead and let the caller figure out 
how to find it?

I'm just thinking that we're hardcoding just one specific way to find the 
target, while there may potentially be more.
The current way is OK, as we have just one use case at the moment.





Comment at: clang/lib/Sema/SemaCUDA.cpp:137
+  // Code that lives outside a function gets the target from CurCUDATargetCtx.
+  if (D == nullptr) {
+return CurCUDATargetCtx.Target;

Style nit: no braces around single-statement body.



Comment at: clang/test/CodeGenCUDA/global-initializers.cu:11-12
+// Check host/device-based overloding resolution in global variable 
initializer.
+template
+T pow(T, U) { return 1.0; }
+

We don't really need templates to reproduce the issue. We just need a host 
function with lower overloading priority. A function requiring type conversion 
or with an additional default argument should do. E.g.  `float pow(float, int); 
` or `double X = pow(double, int, bool lower_priority_host_overload=1);`

Removing template should unclutter the tests a bit.



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158247/new/

https://reviews.llvm.org/D158247

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D158226: [CUDA/NVPTX] Improve handling of memcpy for -Os compilations.

2023-08-17 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, bixia, hiraditya, 
yaxunl.
Herald added a project: All.
tra published this revision for review.
tra added a reviewer: alexfh.
Herald added subscribers: llvm-commits, cfe-commits, wangpc, jholewinski.
Herald added projects: clang, LLVM.

We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on NVPTX as there's no
standard library to call.

The patch relaxes the threshold used for -Os compilation so we're always allowed
to inline memory copy functions.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D158226

Files:
  clang/test/CodeGenCUDA/memcpy-libcall.cu
  llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp


Index: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
===
--- llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -386,9 +386,9 @@
   // always lower memset, memcpy, and memmove intrinsics to load/store
   // instructions, rather
   // then generating calls to memset, mempcy or memmove.
-  MaxStoresPerMemset = (unsigned) 0x;
-  MaxStoresPerMemcpy = (unsigned) 0x;
-  MaxStoresPerMemmove = (unsigned) 0x;
+  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = (unsigned)0x;
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = (unsigned) 0x;
+  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = (unsigned) 0x;
 
   setBooleanContents(ZeroOrNegativeOneBooleanContent);
   setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
Index: clang/test/CodeGenCUDA/memcpy-libcall.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/memcpy-libcall.cu
@@ -0,0 +1,61 @@
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -O3 -S %s -o - | FileCheck -check-prefix=PTX %s
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -Os -S %s -o - | FileCheck -check-prefix=PTX %s
+#include "Inputs/cuda.h"
+
+// PTX-LABEL: .func _Z12copy_genericPvPKv(
+void __device__ copy_generic(void *dest, const void *src) {
+  __builtin_memcpy(dest, src, 32);
+// PTX:ld.u8
+// PTX:st.u8
+}
+
+// PTX-LABEL: .entry _Z11copy_globalPvS_(
+void __global__ copy_global(void *dest, void * src) {
+  __builtin_memcpy(dest, src, 32);
+// PTX:ld.global.u8
+// PTX:st.global.u8
+}
+
+struct S {
+  int data[8];
+};
+
+// PTX-LABEL: .entry _Z20copy_param_to_globalP1SS_(
+void __global__ copy_param_to_global(S *global, S param) {
+  __builtin_memcpy(global, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.global.u32
+}
+
+// PTX-LABEL: .entry _Z19copy_param_to_localPU3AS51SS_(
+void __global__ copy_param_to_local(__attribute__((address_space(5))) S *local,
+S param) {
+  __builtin_memcpy(local, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.local.u32
+}
+
+// PTX-LABEL: .func _Z21copy_local_to_genericP1SPU3AS5S_(
+void __device__ copy_local_to_generic(S *generic,
+ __attribute__((address_space(5))) S *src) 
{
+  __builtin_memcpy(generic, src, sizeof(S));
+// PTX:ld.local.u32
+// PTX:st.u32
+}
+
+__shared__ S shared;
+
+// PTX-LABEL: .entry _Z20copy_param_to_shared1S(
+void __global__ copy_param_to_shared( S param) {
+  __builtin_memcpy(, , sizeof(S));
+// PTX:ld.param.u32
+// PTX:st.shared.u32
+}
+
+void __device__ copy_shared_to_generic(S *generic) {
+  __builtin_memcpy(generic, , sizeof(S));
+// PTX:ld.shared.u32
+// PTX:st.u32
+}


Index: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
===
--- llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -386,9 +386,9 @@
   // always lower memset, memcpy, and memmove intrinsics to load/store
   // instructions, rather
   // then generating calls to memset, mempcy or memmove.
-  MaxStoresPerMemset = (unsigned) 0x;
-  MaxStoresPerMemcpy = (unsigned) 0x;
-  MaxStoresPerMemmove = (unsigned) 0x;
+  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = (unsigned)0x;
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = (unsigned) 0x;
+  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = (unsigned) 0x;
 
   setBooleanContents(ZeroOrNegativeOneBooleanContent);
   setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
Index: clang/test/CodeGenCUDA/memcpy-libcall.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/memcpy-libcall.cu
@@ -0,0 +1,61 @@
+// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
+// RUN: -O3 -S %s -o - | FileCheck -check-prefix=PTX %s
+// RUN:

[PATCH] D157750: Properly handle -fsplit-machine-functions for fatbinary compilation

2023-08-17 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/test/Driver/fsplit-machine-functions-with-cuda-nvptx.c:16
+// causes a warning.
+// RUN:   %clang --target=x86_64-unknown-linux-gnu -nogpulib -nogpuinc \
+// RUN: --cuda-gpu-arch=sm_70 -x cuda -fsplit-machine-functions -S %s 2>&1 
\

Hahnfeld wrote:
> steelannelida wrote:
> > Unfortunately these commands fail in our sandbox due to writing files to 
> > readonly directories:
> > 
> >  `unable to open output file 'fsplit-machine-functions-with-cuda-nvptx.s': 
> > 'Permission denied'`
> > 
> > Could you please specify the output files via `%t` substitutions? I'm not 
> > sure how to do this for cuda compilation.
> IIRC the file names are generated based on what you specify with `-o`. Did 
> you try this already?
The problem is that in this case we didn't pass any -o at all here, so the 
compiler tries to write into the current directory.

We need `-o %t.s` or `-o /dev/null` here.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157750/new/

https://reviews.llvm.org/D157750

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D157750: Properly handle -fsplit-machine-functions for fatbinary compilation

2023-08-14 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/test/Driver/fsplit-machine-functions-with-cuda-nvptx.c:9
+
+// Check that -fsplit-machine-functions is passed to both x86 and cuda 
compilation and does not cause driver error.
+// MFS2: -fsplit-machine-functions

shenhan wrote:
> tra wrote:
> > shenhan wrote:
> > > tra wrote:
> > > > We will still see a warning, right? So, for someone compiling with 
> > > > `-Werror` that's going to be a problem.
> > > > 
> > > > Also, if the warning is issued from the top-level driver, we may not 
> > > > even be able to suppress it when we disable splitting on GPU side with 
> > > > `-Xarch_device -fno-split-machine-functions`.
> > > > 
> > > > 
> > > > We will still see a warning, right?
> > > Yes, there still will be a warning. We've discussed it and we think that 
> > > pass -fsplit-machine-functions in this case is not a proper usage and a 
> > > warning is warranted, and it is not good that skip doing split silently 
> > > while uses explicitly ask for it.
> > > 
> > > > Also, if the warning is issued from the top-level driver
> > > The warning will not be issued from the top-level driver, it will be 
> > > issued when configuring optimization passes.
> > > So:
> > > 
> > > 
> > >   - -fsplit-machine-functions -Xarch_device -fno-split-machine-functions
> > > Will enable MFS for host, disable MFS for gpus and without any warnings.
> > > 
> > >   - -Xarch_host -fsplit-machine-functions
> > > The same as the above
> > > 
> > >   - -Xarch_host -fsplit-machine-functions -Xarch_device 
> > > -fno-split-machine-functions
> > > The same as the above
> > > 
> > > We've discussed it and we think that pass -fsplit-machine-functions in 
> > > this case is not a proper usage and a warning is warranted, and it is not 
> > > good that skip doing split silently while uses explicitly ask for it.
> > 
> > I would agree with that assertion if we were talking exclusively about CUDA 
> > compilation.
> > However, a common real world use pattern is that the flags are set globally 
> > for all C++ compilations, and then CUDA compilations within the project 
> > need to do whatever they need to to keep things working. The original user 
> > intent was for the option to affect the host compilation. There's no 
> > inherent assumption that it will do anything useful for the GPU.
> > 
> > In number of similar cases in the past we did settle on silently ignoring 
> > some top-level flags that we do expect to encounter in real projects, but 
> > which made no sense for the GPU. E.g. sanitizers. If the project is built 
> > w/ sanitizer enabled, the idea is to sanitize the host code, The GPU code 
> > continues to be built w/o sanitizer enabled. 
> > 
> > Anyways, as long as we have a way to deal with it it's not a big deal one 
> > way or another.
> > 
> > > -fsplit-machine-functions -Xarch_device -fno-split-machine-functions
> > > Will enable MFS for host, disable MFS for gpus and without any warnings.
> > 
> > OK. This will work.
> > 
> > 
> > In number of similar cases in the past we did settle on silently ignoring 
> > some top-level flags that we do expect to encounter in real projects, but 
> > which made no sense for the GPU. E.g. sanitizers. If the project is built 
> > w/ sanitizer enabled, the idea is to sanitize the host code, The GPU code 
> > continues to be built w/o sanitizer enabled.
> 
> Can I understand it this way - if the compiler is **only** building for CPUs, 
> then silently ignore any optimization flags is not a good behavior. If the 
> compiler is building CPUs and GPUs, it is still not a good behavior to 
> silently ignore optimization flags for CPUs, but it is probably ok to 
> silently ignore optimization flags for GPUs.
> 
> > OK. This will work.
> Thanks for confirming.
>  it is probably ok to silently ignore optimization flags for GPUs.

In this case, yes. 

I think the most consistent way to handle the situation is to keep the warning 
in place at cc1 compiler level, but change the driver behavior (and document 
it) so that it does not pass the splitting options to offloading 
sub-compilations. This way we'll do the sensible thing for the most common use 
case, yet would still warn if the user tries to enable the splitting where they 
should not (e.g. by using `-Xclang -fsplit-machine-functions` during CUDA 
compilation)






Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157750/new/

https://reviews.llvm.org/D157750

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D157750: Properly handle -fsplit-machine-functions for fatbinary compilation

2023-08-11 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.

Comment at: clang/test/Driver/fsplit-machine-functions-with-cuda-nvptx.c:9
+
+// Check that -fsplit-machine-functions is passed to both x86 and cuda 
compilation and does not cause driver error.
+// MFS2: -fsplit-machine-functions

shenhan wrote:
> tra wrote:
> > We will still see a warning, right? So, for someone compiling with 
> > `-Werror` that's going to be a problem.
> > 
> > Also, if the warning is issued from the top-level driver, we may not even 
> > be able to suppress it when we disable splitting on GPU side with 
> > `-Xarch_device -fno-split-machine-functions`.
> > 
> > 
> > We will still see a warning, right?
> Yes, there still will be a warning. We've discussed it and we think that pass 
> -fsplit-machine-functions in this case is not a proper usage and a warning is 
> warranted, and it is not good that skip doing split silently while uses 
> explicitly ask for it.
> 
> > Also, if the warning is issued from the top-level driver
> The warning will not be issued from the top-level driver, it will be issued 
> when configuring optimization passes.
> So:
> 
> 
>   - -fsplit-machine-functions -Xarch_device -fno-split-machine-functions
> Will enable MFS for host, disable MFS for gpus and without any warnings.
> 
>   - -Xarch_host -fsplit-machine-functions
> The same as the above
> 
>   - -Xarch_host -fsplit-machine-functions -Xarch_device 
> -fno-split-machine-functions
> The same as the above
> 
> We've discussed it and we think that pass -fsplit-machine-functions in this 
> case is not a proper usage and a warning is warranted, and it is not good 
> that skip doing split silently while uses explicitly ask for it.

I would agree with that assertion if we were talking exclusively about CUDA 
compilation.
However, a common real world use pattern is that the flags are set globally for 
all C++ compilations, and then CUDA compilations within the project need to do 
whatever they need to to keep things working. The original user intent was for 
the option to affect the host compilation. There's no inherent assumption that 
it will do anything useful for the GPU.

In number of similar cases in the past we did settle on silently ignoring some 
top-level flags that we do expect to encounter in real projects, but which made 
no sense for the GPU. E.g. sanitizers. If the project is built w/ sanitizer 
enabled, the idea is to sanitize the host code, The GPU code continues to be 
built w/o sanitizer enabled. 

Anyways, as long as we have a way to deal with it it's not a big deal one way 
or another.

> -fsplit-machine-functions -Xarch_device -fno-split-machine-functions
> Will enable MFS for host, disable MFS for gpus and without any warnings.

OK. This will work.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157750/new/

https://reviews.llvm.org/D157750

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D157750: Properly handle -fsplit-machine-functions for fatbinary compilation

2023-08-11 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.






Comment at: clang/test/Driver/fsplit-machine-functions-with-cuda-nvptx.c:9
+
+// Check that -fsplit-machine-functions is passed to both x86 and cuda 
compilation and does not cause driver error.
+// MFS2: -fsplit-machine-functions

We will still see a warning, right? So, for someone compiling with `-Werror` 
that's going to be a problem.

Also, if the warning is issued from the top-level driver, we may not even be 
able to suppress it when we disable splitting on GPU side with `-Xarch_device 
-fno-split-machine-functions`.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157750/new/

https://reviews.llvm.org/D157750

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D156014: [Clang][NVPTX] Permit use of the alias attribute for NVPTX targets

2023-08-07 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaDeclAttr.cpp:1995
   }
-  if (S.Context.getTargetInfo().getTriple().isNVPTX()) {
-S.Diag(AL.getLoc(), diag::err_alias_not_supported_on_nvptx);

jhuber6 wrote:
> jhuber6 wrote:
> > tra wrote:
> > > tra wrote:
> > > > Allowing or not `noreturn` depends on the CUDA version we're building 
> > > > with (or rather on the PTX version we need for .noreturn instruction).
> > > > 
> > > > We would still need to issue the diagnostics if we're using CUDA older 
> > > > than 10.1.
> > > > 
> > > Make it `.alias` and `CUDA older than 10.0`.
> > Do we do any similar diagnostics checks on the CUDA version? I thought that 
> > was more of a clang driver thing and we'd just let the backend handle the 
> > failure, since we can emit LLVM-IR that can be compiled irrespective of the 
> > CUDA version used to make it.
> I checked and I don't think we pass in any CUDA version information to the 
> `-cc1` compiler. In this case if the user didn't have sufficient utilities it 
> would simply fail in the backend or in PTX. We have semi-helpful messages 
> there and it would be a good indicator to update CUDA. Is this fine given 
> that?
We do pass it via `-target-sdk-version=...` 
https://github.com/llvm/llvm-project/blob/1b74459df8a6d960f7387f0c8379047e42811f58/clang/lib/Driver/ToolChains/Clang.cpp#L4707

And then check with `getSDKVersion`. E.g. 
https://github.com/llvm/llvm-project/blob/1b74459df8a6d960f7387f0c8379047e42811f58/clang/lib/CodeGen/CGCUDANV.cpp#L317



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D156014/new/

https://reviews.llvm.org/D156014

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D156014: [Clang][NVPTX] Permit use of the alias attribute for NVPTX targets

2023-07-21 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaDeclAttr.cpp:1995
   }
-  if (S.Context.getTargetInfo().getTriple().isNVPTX()) {
-S.Diag(AL.getLoc(), diag::err_alias_not_supported_on_nvptx);

tra wrote:
> Allowing or not `noreturn` depends on the CUDA version we're building with 
> (or rather on the PTX version we need for .noreturn instruction).
> 
> We would still need to issue the diagnostics if we're using CUDA older than 
> 10.1.
> 
Make it `.alias` and `CUDA older than 10.0`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D156014/new/

https://reviews.llvm.org/D156014

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D156014: [Clang][NVPTX] Permit use of the alias attribute for NVPTX targets

2023-07-21 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaDeclAttr.cpp:1995
   }
-  if (S.Context.getTargetInfo().getTriple().isNVPTX()) {
-S.Diag(AL.getLoc(), diag::err_alias_not_supported_on_nvptx);

Allowing or not `noreturn` depends on the CUDA version we're building with (or 
rather on the PTX version we need for .noreturn instruction).

We would still need to issue the diagnostics if we're using CUDA older than 
10.1.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D156014/new/

https://reviews.llvm.org/D156014

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154559: [clang] Fix constant evaluation about static member function

2023-07-18 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

@rsmith Richard, PTAL. This needs your language lawyering expertise.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154559/new/

https://reviews.llvm.org/D154559

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155539: [CUDA][HIP] Use the same default language std as C++

2023-07-18 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.

We should probably update documentation that C++ standard version for CUDA/HIP 
compilation now matches C++ default instead of previously used c++14.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155539/new/

https://reviews.llvm.org/D155539

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154822: [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA

2023-07-13 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rGf05b58a9468c: [clang] Support 
-fgpu-default-stream=per-thread for NVIDIA CUDA (authored by 
boxu-zhang, committed by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154822/new/

https://reviews.llvm.org/D154822

Files:
  clang/lib/CodeGen/CGCUDANV.cpp
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/CodeGenCUDA/Inputs/cuda.h
  clang/test/CodeGenCUDA/kernel-call.cu


Index: clang/test/CodeGenCUDA/kernel-call.cu
===
--- clang/test/CodeGenCUDA/kernel-call.cu
+++ clang/test/CodeGenCUDA/kernel-call.cu
@@ -2,6 +2,9 @@
 // RUN: | FileCheck %s --check-prefixes=CUDA-OLD,CHECK
 // RUN: %clang_cc1 -target-sdk-version=9.2  -emit-llvm %s -o - \
 // RUN: | FileCheck %s --check-prefixes=CUDA-NEW,CHECK
+// RUN: %clang_cc1 -target-sdk-version=9.2  -emit-llvm %s -o - \
+// RUN:   -fgpu-default-stream=per-thread -DCUDA_API_PER_THREAD_DEFAULT_STREAM 
\
+// RUN: | FileCheck %s --check-prefixes=CUDA-PTH,CHECK
 // RUN: %clang_cc1 -x hip -emit-llvm %s -o - \
 // RUN: | FileCheck %s --check-prefixes=HIP-OLD,CHECK
 // RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \
@@ -25,6 +28,7 @@
 // CUDA-OLD: call{{.*}}cudaLaunch
 // CUDA-NEW: call{{.*}}__cudaPopCallConfiguration
 // CUDA-NEW: call{{.*}}cudaLaunchKernel
+// CUDA-PTH: call{{.*}}cudaLaunchKernel_ptsz
 __global__ void g1(int x) {}
 
 // CHECK-LABEL: define{{.*}}main
Index: clang/test/CodeGenCUDA/Inputs/cuda.h
===
--- clang/test/CodeGenCUDA/Inputs/cuda.h
+++ clang/test/CodeGenCUDA/Inputs/cuda.h
@@ -58,6 +58,10 @@
 extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
 dim3 blockDim, void **args,
 size_t sharedMem, cudaStream_t stream);
+extern "C" cudaError_t cudaLaunchKernel_ptsz(const void *func, dim3 gridDim,
+dim3 blockDim, void **args,
+size_t sharedMem, cudaStream_t stream);
+
 #endif
 
 extern "C" __device__ int printf(const char*, ...);
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -574,6 +574,9 @@
   Builder.defineMacro("__CLANG_RDC__");
 if (!LangOpts.HIP)
   Builder.defineMacro("__CUDA__");
+if (LangOpts.GPUDefaultStream ==
+LangOptions::GPUDefaultStreamKind::PerThread)
+  Builder.defineMacro("CUDA_API_PER_THREAD_DEFAULT_STREAM");
   }
   if (LangOpts.HIP) {
 Builder.defineMacro("__HIP__");
Index: clang/lib/CodeGen/CGCUDANV.cpp
===
--- clang/lib/CodeGen/CGCUDANV.cpp
+++ clang/lib/CodeGen/CGCUDANV.cpp
@@ -358,9 +358,13 @@
   TranslationUnitDecl *TUDecl = CGM.getContext().getTranslationUnitDecl();
   DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);
   std::string KernelLaunchAPI = "LaunchKernel";
-  if (CGF.getLangOpts().HIP && CGF.getLangOpts().GPUDefaultStream ==
-   
LangOptions::GPUDefaultStreamKind::PerThread)
-KernelLaunchAPI = KernelLaunchAPI + "_spt";
+  if (CGF.getLangOpts().GPUDefaultStream ==
+  LangOptions::GPUDefaultStreamKind::PerThread) {
+if (CGF.getLangOpts().HIP)
+  KernelLaunchAPI = KernelLaunchAPI + "_spt";
+else if (CGF.getLangOpts().CUDA)
+  KernelLaunchAPI = KernelLaunchAPI + "_ptsz";
+  }
   auto LaunchKernelName = addPrefixToName(KernelLaunchAPI);
   IdentifierInfo  =
   CGM.getContext().Idents.get(LaunchKernelName);


Index: clang/test/CodeGenCUDA/kernel-call.cu
===
--- clang/test/CodeGenCUDA/kernel-call.cu
+++ clang/test/CodeGenCUDA/kernel-call.cu
@@ -2,6 +2,9 @@
 // RUN: | FileCheck %s --check-prefixes=CUDA-OLD,CHECK
 // RUN: %clang_cc1 -target-sdk-version=9.2  -emit-llvm %s -o - \
 // RUN: | FileCheck %s --check-prefixes=CUDA-NEW,CHECK
+// RUN: %clang_cc1 -target-sdk-version=9.2  -emit-llvm %s -o - \
+// RUN:   -fgpu-default-stream=per-thread -DCUDA_API_PER_THREAD_DEFAULT_STREAM \
+// RUN: | FileCheck %s --check-prefixes=CUDA-PTH,CHECK
 // RUN: %clang_cc1 -x hip -emit-llvm %s -o - \
 // RUN: | FileCheck %s --check-prefixes=HIP-OLD,CHECK
 // RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \
@@ -25,6 +28,7 @@
 // CUDA-OLD: call{{.*}}cudaLaunch
 // CUDA-NEW: call{{.*}}__cudaPopCallConfiguration
 // CUDA-NEW: call{{.*}}cudaLaunchKernel
+// CUDA-PTH: call{{.*}}cudaLaunchKernel_ptsz
 __global__ void g1(int x) {}
 
 // CHECK-LABEL: define{{.*}}main
Index: clang/test/CodeGenCUDA/Inputs/cuda.h

[PATCH] D154822: [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA

2023-07-12 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

> Can anyone push this?

I can help with this. How do you want your commit to be attributed? The patch 
currently has `boxu.zhang `. Do you want it to be 
changed to something else?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154822/new/

https://reviews.llvm.org/D154822

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154300: [CUDA][HIP] Fix template argument deduction

2023-07-11 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaOverload.cpp:12758-12764
+std::optional MorePreferableByCUDA =
+CheckCUDAPreference(FD, Result);
+// If FD has different CUDA preference than Result.
+if (MorePreferableByCUDA) {
+  // FD is less preferable than Result.
+  if (!*MorePreferableByCUDA)
+continue;

Maybe `CheckCUDAPreference` should return -1/0/1 or an enum. std::optional does 
not seem to be very readable here.

E.g. `if(MorePreferableByCUDA)` sounds like it's going to be satisfied when FD 
is a better choice than Result, but it's not the case.
I think this would be easier to follow:
```
if (CheckCUDAPreference(FD, Result) <= 0) // or `!= CP_BETTER`
 continue;
```



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154300/new/

https://reviews.llvm.org/D154300

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154797: [CUDA][HIP] Rename and fix `-fcuda-approx-transcendentals`

2023-07-10 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: clang/lib/Frontend/InitPreprocessor.cpp:1294
+if (!LangOpts.HIP)
+  Builder.defineMacro("__CLANG_CUDA_APPROX_TRANSCENDENTALS__");
+Builder.defineMacro("__CLANG_GPU_APPROX_TRANSCENDENTALS__");

I think we can remove it. I don't think we need to keep the old one around. 
Internal headers have been changed and the macro was never intended for public 
use. 


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154797/new/

https://reviews.llvm.org/D154797

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154797: [CUDA][HIP] Rename and fix `-fcuda-approx-transcendentals`

2023-07-10 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Looks good in general.




Comment at: clang/lib/Driver/ToolChains/Clang.cpp:7221-7223
+bool UseApproxTranscendentals = false;
+if (Args.hasFlag(options::OPT_ffast_math, options::OPT_fno_fast_math,
+ false))

```
bool UseApproxTranscendentals = Args.hasFlag(options::OPT_ffast_math, 
options::OPT_fno_fast_math,  false));
```



Comment at: clang/lib/Frontend/InitPreprocessor.cpp:1292-1293
+  if (LangOpts.GPUDeviceApproxTranscendentals) {
+Builder.defineMacro(Twine("__CLANG_") + (LangOpts.HIP ? "HIP" : "CUDA") +
+"_APPROX_TRANSCENDENTALS__");
   }

We may want to rename the macro to `__CLANG_GPU_APPROX_TRANSCENDENTALS__`, too. 



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154797/new/

https://reviews.llvm.org/D154797

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154822: Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA

2023-07-10 Thread Artem Belevich via Phabricator via cfe-commits

tra added a reviewer: tra.
tra added a comment.

Looking at CUDA headers, it appears that changing only compiler-generated-glue 
may be insufficient. A lot of other CUDA API calls need to be changed to 
`_ptsz` variant and for that we need to have 
`CUDA_API_PER_THREAD_DEFAULT_STREAM` defined.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154822/new/

https://reviews.llvm.org/D154822

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D154077: [HIP] Fix version detection for old HIP-PATH

2023-06-29 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM in general with a minor suggestion.




Comment at: clang/lib/Driver/ToolChains/AMDGPU.cpp:471
  {std::string(SharePath) + "/hip/version",
+  std::string(ParentSharePath) + "/hip/version",
   std::string(BinPath) + "/.hipVersion"}) {

We seem to be rather inconsistent about how we handle paths.

Above, we use `llvm::sys::path::append`, but here we revert to just appending a 
path as a string. 

I think we should be using llvm::sys::path API consistently. It's unfortunate 
that the API does not provide a string-returning function to append elements.
```
auto Append = [](SmallVectorImpl , const Twine ,
 const Twine  = "",
 const Twine  = "",
 const Twine  = "") {
SmallVectorImpl newpath = path;
llvm::sys::path::append(newpath, a,b,c,d);
return newpath; 
}
for (const auto  :
 {Append(SharePath, "hip", "version"),
  Append(ParentSharePath,  "hip", "version"),
  Append(BinPath, ".hipVersion")}) {
  ...
}
```



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154077/new/

https://reviews.llvm.org/D154077

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-27 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

@bkramer Ben, PTAL when you get a chance.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-23 Thread Artem Belevich via Phabricator via cfe-commits

tra planned changes to this revision.
tra added a comment.

We're still missing clang-side tests for the new builtins.
Now that the intrinsics use `bfloat` we also need to change builtin signatures. 
Or change codegen to bitcast to/from bfloat to match the types.
To be continued next week.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-23 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 534130.
tra added a comment.

Fixed few missed places in bf16 lowering.
Changed intrinsic types to use bfloat type.
Auto-upgrade the old intrinsic variants.
Updated broken tests.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/IR/AutoUpgrade.cpp
  llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
  llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
  llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/lib/Target/NVPTX/NVPTXMCExpr.cpp
  llvm/lib/Target/NVPTX/NVPTXMCExpr.h
  llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp
  llvm/lib/Target/NVPTX/NVPTXSubtarget.h
  llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
  llvm/test/CodeGen/NVPTX/bf16-instructions.ll
  llvm/test/CodeGen/NVPTX/convert-sm80.ll
  llvm/test/CodeGen/NVPTX/math-intrins-sm80-ptx70-autoupgrade.ll
  llvm/test/CodeGen/NVPTX/math-intrins-sm80-ptx70.ll
  llvm/test/CodeGen/NVPTX/math-intrins-sm86-ptx72-autoupgrade.ll
  llvm/test/CodeGen/NVPTX/math-intrins-sm86-ptx72.ll

Index: llvm/test/CodeGen/NVPTX/math-intrins-sm86-ptx72.ll
===
--- llvm/test/CodeGen/NVPTX/math-intrins-sm86-ptx72.ll
+++ llvm/test/CodeGen/NVPTX/math-intrins-sm86-ptx72.ll
@@ -9,10 +9,10 @@
 declare <2 x half> @llvm.nvvm.fmin.ftz.xorsign.abs.f16x2(<2 x half> , <2 x half>)
 declare <2 x half> @llvm.nvvm.fmin.nan.xorsign.abs.f16x2(<2 x half> , <2 x half>)
 declare <2 x half> @llvm.nvvm.fmin.ftz.nan.xorsign.abs.f16x2(<2 x half> , <2 x half>)
-declare i16 @llvm.nvvm.fmin.xorsign.abs.bf16(i16, i16)
-declare i16 @llvm.nvvm.fmin.nan.xorsign.abs.bf16(i16, i16)
-declare i32 @llvm.nvvm.fmin.xorsign.abs.bf16x2(i32, i32)
-declare i32 @llvm.nvvm.fmin.nan.xorsign.abs.bf16x2(i32, i32)
+declare bfloat @llvm.nvvm.fmin.xorsign.abs.bf16(bfloat, bfloat)
+declare bfloat @llvm.nvvm.fmin.nan.xorsign.abs.bf16(bfloat, bfloat)
+declare <2 x bfloat> @llvm.nvvm.fmin.xorsign.abs.bf16x2(<2 x bfloat>, <2 x bfloat>)
+declare <2 x bfloat> @llvm.nvvm.fmin.nan.xorsign.abs.bf16x2(<2 x bfloat>, <2 x bfloat>)
 declare float @llvm.nvvm.fmin.xorsign.abs.f(float, float)
 declare float @llvm.nvvm.fmin.ftz.xorsign.abs.f(float, float)
 declare float @llvm.nvvm.fmin.nan.xorsign.abs.f(float, float)
@@ -26,10 +26,10 @@
 declare <2 x half> @llvm.nvvm.fmax.ftz.xorsign.abs.f16x2(<2 x half> , <2 x half>)
 declare <2 x half> @llvm.nvvm.fmax.nan.xorsign.abs.f16x2(<2 x half> , <2 x half>)
 declare <2 x half> @llvm.nvvm.fmax.ftz.nan.xorsign.abs.f16x2(<2 x half> , <2 x half>)
-declare i16 @llvm.nvvm.fmax.xorsign.abs.bf16(i16, i16)
-declare i16 @llvm.nvvm.fmax.nan.xorsign.abs.bf16(i16, i16)
-declare i32 @llvm.nvvm.fmax.xorsign.abs.bf16x2(i32, i32)
-declare i32 @llvm.nvvm.fmax.nan.xorsign.abs.bf16x2(i32, i32)
+declare bfloat @llvm.nvvm.fmax.xorsign.abs.bf16(bfloat, bfloat)
+declare bfloat @llvm.nvvm.fmax.nan.xorsign.abs.bf16(bfloat, bfloat)
+declare <2 x bfloat> @llvm.nvvm.fmax.xorsign.abs.bf16x2(<2 x bfloat>, <2 x bfloat>)
+declare <2 x bfloat> @llvm.nvvm.fmax.nan.xorsign.abs.bf16x2(<2 x bfloat>, <2 x bfloat>)
 declare float @llvm.nvvm.fmax.xorsign.abs.f(float, float)
 declare float @llvm.nvvm.fmax.ftz.xorsign.abs.f(float, float)
 declare float @llvm.nvvm.fmax.nan.xorsign.abs.f(float, float)
@@ -100,35 +100,35 @@
 }
 
 ; CHECK-LABEL: fmin_xorsign_abs_bf16
-define i16 @fmin_xorsign_abs_bf16(i16 %0, i16 %1) {
+define bfloat @fmin_xorsign_abs_bf16(bfloat %0, bfloat %1) {
   ; CHECK-NOT: call
   ; CHECK: min.xorsign.abs.bf16
-  %res = call i16 @llvm.nvvm.fmin.xorsign.abs.bf16(i16 %0, i16 %1)
-  ret i16 %res
+  %res = call bfloat @llvm.nvvm.fmin.xorsign.abs.bf16(bfloat %0, bfloat %1)
+  ret bfloat %res
 }
 
 ; CHECK-LABEL: fmin_nan_xorsign_abs_bf16
-define i16 @fmin_nan_xorsign_abs_bf16(i16 %0, i16 %1) {
+define bfloat @fmin_nan_xorsign_abs_bf16(bfloat %0, bfloat %1) {
   ; CHECK-NOT: call
   ; CHECK: min.NaN.xorsign.abs.bf16
-  %res = call i16 @llvm.nvvm.fmin.nan.xorsign.abs.bf16(i16 %0, i16 %1)
-  ret i16 %res
+  %res = call bfloat @llvm.nvvm.fmin.nan.xorsign.abs.bf16(bfloat %0, bfloat %1)
+  ret bfloat %res
 }
 
 ; CHECK-LABEL: fmin_xorsign_abs_bf16x2
-define i32 @fmin_xorsign_abs_bf16x2(i32 %0, i32 %1) {
+define <2 x bfloat> @fmin_xorsign_abs_bf16x2(<2 x bfloat> %0, <2 x bfloat> %1) {
   ; CHECK-NOT: call
   ; CHECK: min.xorsign.abs.bf16x2
-  %res = call i32 @llvm.nvvm.fmin.xorsign.abs.bf16x2(i32 %0, i32 %1)
-  ret i32 %res
+  %res = call <2 x bfloat> @llvm.nvvm.fmin.xorsign.abs.bf16x2(<2 x bfloat> %0, <2 x bfloat> %1)
+  ret <2 x bfloat> %res
 }
 
 ; CHECK-LABEL: fmin_nan_xorsign_abs_bf16x2
-define i32 @fmin_nan_xorsign_abs_bf16x2(i32 %0, i32 %1) {
+define <2 x bfloat> @fmin_nan_xorsign_abs_bf16x2(<2 x bfloat> %0, <2 x bfloat> %1) {
   ;

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-23 Thread Artem Belevich via Phabricator via cfe-commits

tra commandeered this revision.
tra edited reviewers, added: kushanam; removed: tra.
tra added a comment.
This revision now requires review to proceed.
Herald added a subscriber: bixia.

I've got a few more fixes for the patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-23 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

The latest patch revision still fails on a few LLVM tests:

  Failed Tests (3):
LLVM :: CodeGen/NVPTX/bf16-instructions.ll
LLVM :: CodeGen/NVPTX/f16x2-instructions.ll
LLVM :: CodeGen/NVPTX/math-intrins-sm80-ptx70.ll




Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:1308
 return TypeSplitVector;
-  if (VT == MVT::v2f16)
+  if (Isv2f16Orv2bf16Type((EVT)VT))
 return TypeLegal;

I do not think the cast is necessary. 




Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:595-596
 FromName, ".f16 \t$dst, $src;"), []>;
+def _bf16 :
+  NVPTXInst<(outs RC:$dst),
+(ins Int16Regs:$src, CvtMode:$mode),

tra wrote:
> While we're here, it also needs `Requires<[hasPTX<70>, hasSM<80>]>`
This is needed in *addition* to whatever predicate is supplied as an argument. 
E.g. when we do `defm CVT_f32 : CVT_FROM_ALL<"f32", Float32Regs>;` conversion 
from f32 to bf16 should still be predicated on `[hasPTX<70>, hasSM<80>]`.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-20 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:559-568
-multiclass CVT_FROM_FLOAT_SM80 {
-def _f32 :
-  NVPTXInst<(outs RC:$dst),
-(ins Float32Regs:$src, CvtMode:$mode),
-!strconcat("cvt${mode:base}${mode:relu}.",
-FromName, ".f32 \t$dst, $src;"), []>,
-Requires<[hasPTX<70>, hasSM<80>]>;

This is where cvt.rn.relu.bf16.f32  was used to be generated before.

Now we've replaced it with `CVT_FROM_ALL` which does not know anything about 
`relu`.



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:595-596
 FromName, ".f16 \t$dst, $src;"), []>;
+def _bf16 :
+  NVPTXInst<(outs RC:$dst),
+(ins Int16Regs:$src, CvtMode:$mode),

While we're here, it also needs `Requires<[hasPTX<70>, hasSM<80>]>`



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:601
 def _f32 :
   NVPTXInst<(outs RC:$dst),
 (ins Float32Regs:$src, CvtMode:$mode),

We may add an optional `list` argument  to the multiclass and 
do`defm CVT_bf16<... [hasPTX<70>, hasSM<80>]>`



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:603
 (ins Float32Regs:$src, CvtMode:$mode),
 !strconcat("cvt${mode:base}${mode:ftz}${mode:sat}.",
 FromName, ".f32 \t$dst, $src;"), []>;

We also need to augment it with `${mode:relu}` 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151361: [CUDA] bump supported CUDA version to 12.1/11.8

2023-06-15 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/docs/ReleaseNotes.rst:590
 
+- Clang now supports CUDA SDK up to 12.1
 

bader wrote:
> @tra, could you update llvm/docs/CompileCudaWithLLVM.rst as well, please?
Done in  d028188412fa54774e2c60e21f0929a0fede93bb


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151361/new/

https://reviews.llvm.org/D151361

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-15 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXIntrinsics.td:1271-1287
-def : Pat<(int_nvvm_ff2f16x2_rn Float32Regs:$a, Float32Regs:$b),
-  (CVT_f16x2_f32 Float32Regs:$a, Float32Regs:$b, CvtRN)>;
-def : Pat<(int_nvvm_ff2f16x2_rn_relu Float32Regs:$a, Float32Regs:$b),
-  (CVT_f16x2_f32 Float32Regs:$a, Float32Regs:$b, CvtRN_RELU)>;
-def : Pat<(int_nvvm_ff2f16x2_rz Float32Regs:$a, Float32Regs:$b),
-  (CVT_f16x2_f32 Float32Regs:$a, Float32Regs:$b, CvtRZ)>;
-def : Pat<(int_nvvm_ff2f16x2_rz_relu Float32Regs:$a, Float32Regs:$b),

tra wrote:
> Were these patterns removed intentionally? We still have intrinsics/builtins 
> defined in llvm/include/llvm/IR/IntrinsicsNVVM.td and still need to lower 
> them.
^^^ this question is still unanswered.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-13 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM with few nits. Thank you for your patience with revising the patch.




Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:629-631
+  const bool IsBFP16FP16x2NegAvailable = STI.getSmVersion() >= 80 &&
+ STI.getPTXVersion() >= 70 &&
+ STI.hasBF16Math();

IsBFP16FP16x2NegAvailable is no longer used and can be removed.



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:159
 def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;
+def useBF16Math: Predicate<"Subtarget->hasBF16Math()">;
 

Nit: I'd rename the record to `hasBF16Math` as the decision is based purely on 
whether we have particular features enabled, and not on whether user input 
allows us to use those instructions or not on the hardware where they are 
present.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-12 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Almost there. Just few cosmetic nits remaining.




Comment at: llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp:64-69
+  case 9:
 OS << "%h";
 break;
   case 8:
+  case 10:
 OS << "%hh";

tra wrote:
> Looks like I've forgot to remove those cases in my regclass patch. Will fix 
> it shortly.
Still not fixed.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:632-634
+  for (const auto  : {MVT::bf16, MVT::v2bf16})
+setOperationAction(ISD::FNEG, VT,
+   IsBFP16FP16x2NegAvailable ? Legal : Expand);

This could be just
```
setBF16OperationAction(ISD::FNEG, MVT::bf16, Legal, Expand);
setBF16OperationAction(ISD::FNEG, MVT::v2bf16, Legal, Expand);
```




Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:159
 def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;
+def useBFP16Math: Predicate<"Subtarget->allowBF16Math()">;
 

Nit: `useBF16Math` as in fp16 -> bf16.



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:1118
+[(set RC:$dst, (fneg (T RC:$src)))]>,
+Requires<[useFP16Math, hasPTX<70>, hasSM<80>, Pred]>;
+def BFNEG16_ftz   : FNEG_BF16_F16X2<"neg.ftz.bf16", bf16, Int16Regs, doF32FTZ>;

I think you need to use `useBF16Math` here.



Comment at: llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp:68
+
+bool NVPTXSubtarget::allowBF16Math() const { return hasBF16Math(); }

We do not need `allowBF16Math` any more.  Just use `hasBF16Math()`.



Comment at: llvm/lib/Target/NVPTX/NVPTXSubtarget.h:81
   bool allowFP16Math() const;
+  bool allowBF16Math() const;
   bool hasMaskOperator() const { return PTXVersion >= 71; }

Not needed.



Comment at: llvm/test/CodeGen/NVPTX/bf16-instructions.ll:16
+define bfloat @test_fadd(bfloat %0, bfloat %1) {
+  %3 = fadd bfloat %0, %1
+  ret bfloat %3

Another test that would be useful is for `fadd bfloat %0, 1.0`



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D16559: [CUDA] Add -fcuda-allow-variadic-functions.

2023-06-09 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

In D16559#4410067 , @garymm wrote:

> Could you please add this to the documentation?
> Could this be made the default? It seems like nvcc does this by default.

Clang already does that, though we only allow variadic functions that don't 
actually use the vararg arguments: https://reviews.llvm.org/D151359
It's sufficient to compile recent CUDA/libcu++ headers w/o errors.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D16559/new/

https://reviews.llvm.org/D16559

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-09 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp:615
   // need to deal with.
   if (Vector.getSimpleValueType() != MVT::v2f16)
 return false;

This needs to be updated to include v2bf16


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D152403: [Clang][CUDA] Disable diagnostics for neon attrs for GPU-side CUDA compilation

2023-06-08 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM with a nit.




Comment at: clang/lib/Sema/SemaType.cpp:8168
+IsTargetCUDAAndHostARM =
+!AuxTI || AuxTI->getTriple().isAArch64() || AuxTI->getTriple().isARM();
+  }

alexander-shaposhnikov wrote:
> tra wrote:
> > Should it be `AuxTI && (AuxTI->getTriple().isAArch64() || 
> > AuxTI->getTriple().isARM();)` ?
> > 
> > I don't think we want IsTargetCUDAAndHostARM to be set to true if there's 
> > no auxTargetInfo (e.g. during any C++ compilation, regardless of the 
> > target).
> we get here only if S.getLangOpts().CUDAIsDevice is true, so not for an 
> arbitrary c++ compilation,
> iirc AuxTI was null for some tests, but I'm happy to double check,
> AuxTI && ... looks better to me too.
I'd still prefer to have `()` around `AuxTI->getTriple().isAArch64() || 
AuxTI->getTriple().isARM()`.




Comment at: clang/test/SemaCUDA/neon-attrs.cu:2
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature +neon -x cuda 
-fsyntax-only -DNO_DIAG -verify %s
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature -neon -x cuda 
-fsyntax-only -verify %s
+

alexander-shaposhnikov wrote:
> tra wrote:
> > You should also pass `-aux-triple nvptx64...`.
> > 
> > This also needs more test cases. This only tests host-side CUDA compilation.
> > We also need:
> > ```
> > // GPU-side compilation on ARM (no errors expected)
> > // RUN: %clang_cc1 -aux-triple arm64-linux-gnu -triple nvptx64 
> > -fcuda-is-device  -x cuda -fsyntax-only -DNO_DIAG -verify %s
> > // Regular C++ compilation on x86 and ARM without neon (should produce 
> > diagnostics) 
> > // RUN: %clang_cc1  -triple x86 -x c++ -fsyntax-only -verify %s
> > // RUN: %clang_cc1  -triple arm64... -x c++ -target-feature -neon 
> > -fsyntax-only -verify %s
> > // C++ on ARM w/ neon (no diagnostics)
> > // RUN: %clang_cc1  -triple arm64... -x c++ -target-feature +neon 
> > -fsyntax-only -DNO_DIAG -verify %s
> > ``` 
> regular C++ compilation is covered by other in-tree tests, do we really need 
> it here ?
If it's already covered (for x86, too?), then you can skip c++ tests.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152403/new/

https://reviews.llvm.org/D152403

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-08 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Overall looks good with few minor nits and a couple of questions.




Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:604
   def int_nvvm_f # operation # variant :
 ClangBuiltin,
 DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_i16_ty, llvm_i16_ty],

tra wrote:
> tra wrote:
> > Availability of these new instructions is conditional on specific CUDA 
> > version and the GPU variant we're compiling for,
> > Such builtins are normally implemented on the clang size as a 
> > `TARGET_BUILTIN()` with appropriate constraints.
> > 
> > Without that `ClangBuiltin` may automatically add enough glue to make them 
> > available in clang unconditionally, which would result in compiler crashing 
> > if a user tries to use one of those builtins with a wrong GPU or CUDA 
> > version. We want to emit a diagnostics, not cause a compiler crash.
> > 
> > Usually such related LLVM and clang changes should be part of the same 
> > patch.
> > 
> > This applies to the new intrinsic variants added below, too.
> I do not think it's is done. 
> 
> Can you check what happens if you try to call any of bf16 builtins while 
> compiling for sm_60? Ideally we should produce a sensible error that the 
> builtin is not available.
> 
> I suspect we will fail in LLVM when we'll fail to lower the intrinsic, ot in 
> nvptx if we've managed to lower it to an instruction unsupported by sm_60.
OK. We'll leave conditional clang builtin handling to be fixed separately as 
it's not directly related to this patch.



Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:878
 def int_nvvm_fma # variant : ClangBuiltin,
-  DefaultAttrsIntrinsic<[llvm_i16_ty],
-[llvm_i16_ty, llvm_i16_ty, llvm_i16_ty],
+  DefaultAttrsIntrinsic<[llvm_bfloat_ty],
+[llvm_bfloat_ty, llvm_bfloat_ty, llvm_bfloat_ty],

This changes signatures of existing intrinsics and builtins. While the change 
is correct, we should at least check that MLIR tests are still passing.




Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1244-1251
   def int_nvvm_ff2bf16x2_rn : ClangBuiltin<"__nvvm_ff2bf16x2_rn">,
Intrinsic<[llvm_i32_ty], [llvm_float_ty, llvm_float_ty], [IntrNoMem, 
IntrNoCallback]>;
   def int_nvvm_ff2bf16x2_rn_relu : ClangBuiltin<"__nvvm_ff2bf16x2_rn_relu">,
   Intrinsic<[llvm_i32_ty], [llvm_float_ty, llvm_float_ty], [IntrNoMem, 
IntrNoCallback]>;
   def int_nvvm_ff2bf16x2_rz : ClangBuiltin<"__nvvm_ff2bf16x2_rz">,
   Intrinsic<[llvm_i32_ty], [llvm_float_ty, llvm_float_ty], [IntrNoMem, 
IntrNoCallback]>;
   def int_nvvm_ff2bf16x2_rz_relu : ClangBuiltin<"__nvvm_ff2bf16x2_rz_relu">,

We've removed the patterns matching these intrinsics in 
lib/Target/NVPTX/NVPTXIntrinsics.td so there's nothing to lower them to an 
instruction now. Was that intentional?



Comment at: llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp:64-69
+  case 9:
 OS << "%h";
 break;
   case 8:
+  case 10:
 OS << "%hh";

Looks like I've forgot to remove those cases in my regclass patch. Will fix it 
shortly.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:640
  ISD::FROUNDEVEN, ISD::FTRUNC}) {
+setOperationAction(Op, MVT::bf16, Legal);
 setOperationAction(Op, MVT::f16, Legal);

Nit: sometimes bf16 variants are added above fp16 variants, sometimes after. It 
would be nice to do it consistently. I guess we should just do a cleanup patch 
sorting these blocks in type order.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:2514-2516
+  if ((Isv2f16Orv2bf16Type(VT.getSimpleVT())) &&
   !allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
   VT, *Store->getMemOperand()))

Unnecessary  `()`around `Isv2f16Orv2bf16Type(VT.getSimpleVT())`



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:2601
   // store them with st.v4.b32.
-  assert((EltVT == MVT::f16 || EltVT == MVT::bf16) &&
+  assert((Isf16Orbf16Type(EltVT.getSimpleVT())) &&
  "Wrong type for the vector.");

Ditto.



Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:1316-1326
 defm FMA16_ftz : FMA_F16<"fma.rn.ftz.f16", f16, Int16Regs, doF32FTZ>;
 defm FMA16 : FMA_F16<"fma.rn.f16", f16, Int16Regs, True>;
 defm FMA16x2_ftz : FMA_F16<"fma.rn.ftz.f16x2", v2f16, Int32Regs, doF32FTZ>;
 defm FMA16x2 : FMA_F16<"fma.rn.f16x2", v2f16, Int32Regs, True>;
+defm BFMA16_ftz : FMA_BF16<"fma.rn.ftz.bf16", bf16, Int16Regs, doF32FTZ>;
+defm BFMA16 : FMA_BF16<"fma.rn.bf16", bf16, Int16Regs, True>;
+defm BFMA16x2_ftz : FMA_BF16<"fma.rn.ftz.bf16x2", v2bf16, Int32Regs, doF32FTZ>;

Nit: align ':' across the block.



Comment at:

[PATCH] D152403: [Clang][CUDA] Disable diagnostics for neon attrs for GPU-side CUDA compilation

2023-06-07 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaType.cpp:8168
+IsTargetCUDAAndHostARM =
+!AuxTI || AuxTI->getTriple().isAArch64() || AuxTI->getTriple().isARM();
+  }

Should it be `AuxTI && (AuxTI->getTriple().isAArch64() || 
AuxTI->getTriple().isARM();)` ?

I don't think we want IsTargetCUDAAndHostARM to be set to true if there's no 
auxTargetInfo (e.g. during any C++ compilation, regardless of the target).



Comment at: clang/lib/Sema/SemaType.cpp:8173-8174
   // not to need a separate attribute)
   if (!S.Context.getTargetInfo().hasFeature("neon") &&
-  !S.Context.getTargetInfo().hasFeature("mve")) {
+  !S.Context.getTargetInfo().hasFeature("mve") && !IsTargetCUDAAndHostARM) 
{
 S.Diag(Attr.getLoc(), diag::err_attribute_unsupported)

Nit: `!(S.Context.getTargetInfo().hasFeature("neon") || 
S.Context.getTargetInfo().hasFeature("mve") || IsTargetCUDAAndHostARM)` would 
be a bit easier to read.



Comment at: clang/test/SemaCUDA/neon-attrs.cu:1
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature +neon -x cuda 
-fsyntax-only -DNO_DIAG -verify %s
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature -neon -x cuda 
-fsyntax-only -verify %s

Instead of replicating the code, you could use different verify prefix for each 
case.
E.g. `-verify=quiet` and then in the body of the test use `// 
quiet-no-diagnostics`.



Comment at: clang/test/SemaCUDA/neon-attrs.cu:2
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature +neon -x cuda 
-fsyntax-only -DNO_DIAG -verify %s
+// RUN: %clang_cc1 -triple arm64-linux-gnu -target-feature -neon -x cuda 
-fsyntax-only -verify %s
+

You should also pass `-aux-triple nvptx64...`.

This also needs more test cases. This only tests host-side CUDA compilation.
We also need:
```
// GPU-side compilation on ARM (no errors expected)
// RUN: %clang_cc1 -aux-triple arm64-linux-gnu -triple nvptx64 -fcuda-is-device 
 -x cuda -fsyntax-only -DNO_DIAG -verify %s
// Regular C++ compilation on x86 and ARM without neon (should produce 
diagnostics) 
// RUN: %clang_cc1  -triple x86 -x c++ -fsyntax-only -verify %s
// RUN: %clang_cc1  -triple arm64... -x c++ -target-feature -neon -fsyntax-only 
-verify %s
// C++ on ARM w/ neon (no diagnostics)
// RUN: %clang_cc1  -triple arm64... -x c++ -target-feature +neon -fsyntax-only 
-DNO_DIAG -verify %s
``` 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152403/new/

https://reviews.llvm.org/D152403

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D152391: [Clang] Allow bitcode linking when the input is LLVM-IR

2023-06-07 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

> clang in.bc -Xclang -mlink-builtin-bitcode -Xclang libdevice.10.bc

If that's something we intend to expose to the user, should we consider 
promoting it to a top-level driver option?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152391/new/

https://reviews.llvm.org/D152391

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-06 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:615
 setFP16OperationAction(Op, MVT::v2f16, Legal, Expand);
-  }
-
-  for (const auto  : {ISD::FADD, ISD::FMUL, ISD::FSUB, ISD::FMA}) {
 setBF16OperationAction(Op, MVT::bf16, Legal, Promote);
 setBF16OperationAction(Op, MVT::v2bf16, Legal, Expand);

kushanam wrote:
> tra wrote:
> > There's still something odd with this patch. The `setBF16OperationAction` 
> > is not in the upstream, but it does not show up in the diff on phabricator. 
> > 
> > Please do rebase on top of the LLVM and make sure that all your changes are 
> > on the git branch you use to send the patch to phabricator. If in doubt how 
> > to get `arc` to do it correctly, you can always create and upload the diff 
> > manually as described here: 
> > https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface
> It is in the first commit, isn't it?https://reviews.llvm.org/D144911?id=500896
It *was* in the first revision of the patch, but it's not in the current one.

The phabricator commit history tracks evolution of the single patch, not a 
dependent set of patches. If you need a dependent patch, that's done via 
submitting each patch individually (i.e. with its own phabricator ID) and then 
recording their relationship via  "Edit related revisions -> Edit child/parent 
revisions". After that you will see them arranged under "stack" sub-tab in the 
"Revision contents" section



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D99201: [HIP] Diagnose unaligned atomic for amdgpu

2023-06-06 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Driver/ToolChains/Clang.cpp:7215
+// warnings as errors.
+CmdArgs.push_back("-Werror=atomic-alignment");
   }

Should it be done from `HIPAMDToolChain::addClangWarningOptions` ?

That's where Darwin does similar propotion from a waring to an error.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99201/new/

https://reviews.llvm.org/D99201

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-05 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:615
 setFP16OperationAction(Op, MVT::v2f16, Legal, Expand);
-  }
-
-  for (const auto  : {ISD::FADD, ISD::FMUL, ISD::FSUB, ISD::FMA}) {
 setBF16OperationAction(Op, MVT::bf16, Legal, Promote);
 setBF16OperationAction(Op, MVT::v2bf16, Legal, Expand);

There's still something odd with this patch. The `setBF16OperationAction` is 
not in the upstream, but it does not show up in the diff on phabricator. 

Please do rebase on top of the LLVM and make sure that all your changes are on 
the git branch you use to send the patch to phabricator. If in doubt how to get 
`arc` to do it correctly, you can always create and upload the diff manually as 
described here: 
https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:689
 setFP16OperationAction(Op, MVT::f16, GetMinMaxAction(Expand), Expand);
+setFP16OperationAction(Op, MVT::bf16, GetMinMaxAction(Expand), Expand);
 setOperationAction(Op, MVT::f32, GetMinMaxAction(Expand));

Should it be `set*BF*16OperationAction` ?



Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:692
 setFP16OperationAction(Op, MVT::v2f16, GetMinMaxAction(Expand), Expand);
-  }
-  for (const auto  : {ISD::FMINNUM, ISD::FMAXNUM}) {
-setBF16OperationAction(Op, MVT::bf16, GetMinMaxAction(Promote), Promote);
-setBF16OperationAction(Op, MVT::v2bf16, GetMinMaxAction(Expand), Expand);
-setBF16OperationAction(Op, MVT::bf16, GetMinMaxAction(Expand), Expand);
-setBF16OperationAction(Op, MVT::v2bf16, GetMinMaxAction(Expand), Expand);
+setFP16OperationAction(Op, MVT::v2bf16, GetMinMaxAction(Expand), Expand);
   }

ditto.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-05 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

FYI https://reviews.llvm.org/D151601 has landed in 
https://github.com/llvm/llvm-project/commit/dc90f42ea7b4f6d9e643f5ad2ba663eba2f9e421.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D152164: [CUDA][HIP] Externalize device var in anonymous namespace

2023-06-05 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: clang/test/CodeGenCUDA/anon-ns.cu:46
+
+// COMMON-DAG: @[[STR1:.*]] = {{.*}} c"[[KERN1]]\00"
+// COMMON-DAG: @[[STR2:.*]] = {{.*}} c"[[KERN2]]\00"

Nit: I'd rename the patterns to reflect the names of the source entities they 
track, so we don't have to dig through multiple dependent matches in order to 
figure out what the test does.
E.g. for `tempKern` : `KERN3`, `STR3` -> `TKERN`, `TKERNSTR`.

Maybe give kernels/variables more distinct names as well. My brain keeps trying 
to interpret `temp` as `temporary`. 
A common naming scheme would be nice. E.g. `tk`, `tv` for the template kernel 
and variable, `a*` for anonymous entities.



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152164/new/

https://reviews.llvm.org/D152164

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D152027: [CUDA] Update Kepler(sm_3*) support info.

2023-06-02 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG0f49116e261c: [CUDA] Update Kepler(sm_3*) support info. 
(authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152027/new/

https://reviews.llvm.org/D152027

Files:
  clang/lib/Basic/Cuda.cpp


Index: clang/lib/Basic/Cuda.cpp
===
--- clang/lib/Basic/Cuda.cpp
+++ clang/lib/Basic/Cuda.cpp
@@ -222,7 +222,11 @@
   case CudaArch::SM_21:
 return CudaVersion::CUDA_80;
   case CudaArch::SM_30:
-return CudaVersion::CUDA_110;
+  case CudaArch::SM_32:
+return CudaVersion::CUDA_102;
+  case CudaArch::SM_35:
+  case CudaArch::SM_37:
+return CudaVersion::CUDA_118;
   default:
 return CudaVersion::NEW;
   }


Index: clang/lib/Basic/Cuda.cpp
===
--- clang/lib/Basic/Cuda.cpp
+++ clang/lib/Basic/Cuda.cpp
@@ -222,7 +222,11 @@
   case CudaArch::SM_21:
 return CudaVersion::CUDA_80;
   case CudaArch::SM_30:
-return CudaVersion::CUDA_110;
+  case CudaArch::SM_32:
+return CudaVersion::CUDA_102;
+  case CudaArch::SM_35:
+  case CudaArch::SM_37:
+return CudaVersion::CUDA_118;
   default:
 return CudaVersion::NEW;
   }
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D152027: [CUDA] Update Kepler(sm_3*) support info.

2023-06-02 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, carlosgalvezp, bixia, yaxunl.
Herald added a project: All.
tra published this revision for review.
tra added a reviewer: jlebar.
tra added a comment.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Kepler is gone! Long live Kepler!


sm_30 and sm_32 were removed in cuda-11.0
sm_35 and sm_37 were removed in cuda-12.0


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D152027

Files:
  clang/lib/Basic/Cuda.cpp


Index: clang/lib/Basic/Cuda.cpp
===
--- clang/lib/Basic/Cuda.cpp
+++ clang/lib/Basic/Cuda.cpp
@@ -222,7 +222,11 @@
   case CudaArch::SM_21:
 return CudaVersion::CUDA_80;
   case CudaArch::SM_30:
-return CudaVersion::CUDA_110;
+  case CudaArch::SM_32:
+return CudaVersion::CUDA_102;
+  case CudaArch::SM_35:
+  case CudaArch::SM_37:
+return CudaVersion::CUDA_118;
   default:
 return CudaVersion::NEW;
   }


Index: clang/lib/Basic/Cuda.cpp
===
--- clang/lib/Basic/Cuda.cpp
+++ clang/lib/Basic/Cuda.cpp
@@ -222,7 +222,11 @@
   case CudaArch::SM_21:
 return CudaVersion::CUDA_80;
   case CudaArch::SM_30:
-return CudaVersion::CUDA_110;
+  case CudaArch::SM_32:
+return CudaVersion::CUDA_102;
+  case CudaArch::SM_35:
+  case CudaArch::SM_37:
+return CudaVersion::CUDA_118;
   default:
 return CudaVersion::NEW;
   }
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151601: [NVPTX] Coalesce register classes for {i16,f16,bf16}, {i32,v2f16,v2bf16}

2023-06-02 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

I've tested the change on a bunch of tensorflow tests and the patch didn't 
cause any apparent issues.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151601/new/

https://reviews.llvm.org/D151601

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-06-02 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

In D144911#4389187 , @manishucsd 
wrote:

> I fail to compile this patch. Please find the compilation error below:
>
>   [build] ./llvm-project/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:1117:40: 
> error: Variable not defined: 'hasPTX70'
>   [build] Requires<[useFP16Math, hasPTX70, hasSM80, Pred]>;
>   [build]^

You need to update your patch. Recent LLVM changes have changed `hasPTXab` -> 
`hasPTX`, and similarly `hasSMab` > `hasSM`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151876: [NVPTX] Signed char and (unsigned)long overloads of ldg and ldu

2023-06-01 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

I'd change the patch title:

- `[NVPTX]` -> `[cuda, NVPTX]` as these are clang changes, not NVPTX back-end.
- `overloads ` -> `builtins`




Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:862
 BUILTIN(__nvvm_ldg_c, "ccC*", "")
+BUILTIN(__nvvm_ldg_sc, "ScScC*", "")
 BUILTIN(__nvvm_ldg_s, "ssC*", "")

One thing that bugs me is that ldg should technically be a `TARGET_BUILTIN(..., 
AND(PTX31,SM_32))`.

Oh, well, that train is gone now that pre-sm3x GPUs are no longer supported by 
NVIDIA anyways.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151876/new/

https://reviews.llvm.org/D151876

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151904: [clang-repl][CUDA] Add an unit test for interactive CUDA

2023-06-01 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/unittests/Interpreter/InteractiveCudaTest.cpp:92
+  std::unique_ptr Interp = createInterpreter();
+  auto Err = Interp->LoadDynamicLibrary("libcudart.so");
+  if (Err) { // CUDA runtime is not installed/usable, cannot continue testing

argentite wrote:
> tra wrote:
> > This could be a bit of a problem.
> > 
> > There may be multiple CUDA SDK versions that may be installed on a system 
> > at any given time and the libcudart.so you pick here may not be the one you 
> > want.
> > E.g it may be from a recent CUDA version which is not supported by NVIDIA 
> > drivers yet. 
> > 
> > I think you may need a way to let the user override CUDA SDK (or 
> > libcudart.so) location explicitly. I guess they could do that via 
> > LD_LIBRARY_PATH, but for the CUDA compilation in general, knowing CUDA SDK 
> > path is essential, as it does affect various compilation options set by the 
> > driver.
> > 
> Yes, this probably would be an issue. It is currently possible to override 
> the CUDA path with a command line argument in clang-repl. But I am not sure 
> what we can do inside a test.
To me it looks like CUDA location should be detected/set at the configuration 
time and then propagated to the individual tests that need that info.
CMake has cuda detection mechanisms that could be used for that purpose.
They are a bit of a pain to use in practice (I'm still not sure what's the 
reliable way to do it), but it's as close to the 'standard' way of doing it as 
we have at the moment.
I believe libc and mlir subtrees in LLVM are already using this mechanism. E.g 
https://github.com/llvm/llvm-project/blob/main/libc/utils/gpu/loader/CMakeLists.txt#L16


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151904/new/

https://reviews.llvm.org/D151904

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151904: [clang-repl][CUDA] Add an unit test for interactive CUDA

2023-06-01 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/unittests/Interpreter/InteractiveCudaTest.cpp:92
+  std::unique_ptr Interp = createInterpreter();
+  auto Err = Interp->LoadDynamicLibrary("libcudart.so");
+  if (Err) { // CUDA runtime is not installed/usable, cannot continue testing

This could be a bit of a problem.

There may be multiple CUDA SDK versions that may be installed on a system at 
any given time and the libcudart.so you pick here may not be the one you want.
E.g it may be from a recent CUDA version which is not supported by NVIDIA 
drivers yet. 

I think you may need a way to let the user override CUDA SDK (or libcudart.so) 
location explicitly. I guess they could do that via LD_LIBRARY_PATH, but for 
the CUDA compilation in general, knowing CUDA SDK path is essential, as it does 
affect various compilation options set by the driver.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151904/new/

https://reviews.llvm.org/D151904

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151839: [LinkerWrapper] Fix static library symbol resolution

2023-05-31 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/test/Driver/linker-wrapper-libs.c:27
 //
 // Check that we extract a static library defining an undefined symbol.
 //

jhuber6 wrote:
> tra wrote:
> > How does this test test the functionality of the undefined symbol? E.g. how 
> > does it fail now, before the patch?
> > 
> > Is there an explicit check we could to do to make sure things work as 
> > intended as opposed to "there's no obvious error" which may also mean "we 
> > forgot to process *undefined.bc".
> Yeah, I wasn't sure how to define a good test for this. The problem I 
> encountered before making this patch was that having another file that used 
> an undefined symbol would override the `NewSymbol` check and then would 
> prevent it from being extracted. So this checks that case.
AFAICT, with -DUNDEFINED, the file would have only `extern int sym;`. CE says 
suggests that it produces an embty bitcode file: https://godbolt.org/z/EY9a8Pfeb

What exactly is supposed to be in the `*.undefined.bc` ?  If it's intended to 
have an undefined reference to `sym` you need to add some sort of a reference 
to it. 



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151839/new/

https://reviews.llvm.org/D151839

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151839: [LinkerWrapper] Fix static library symbol resolution

2023-05-31 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

LGTM in general.




Comment at: clang/test/Driver/linker-wrapper-libs.c:27
 //
 // Check that we extract a static library defining an undefined symbol.
 //

How does this test test the functionality of the undefined symbol? E.g. how 
does it fail now, before the patch?

Is there an explicit check we could to do to make sure things work as intended 
as opposed to "there's no obvious error" which may also mean "we forgot to 
process *undefined.bc".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151839/new/

https://reviews.llvm.org/D151839

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150985: [clang] Allow fp in atomic fetch max/min builtins

2023-05-31 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM with few more test nits.




Comment at: clang/test/Sema/atomic-ops.c:134
int *I, const int *CI,
int **P, float *D, struct S *s1, struct S *s2) {
   __c11_atomic_init(I, 5); // expected-error {{pointer to _Atomic}}

I wonder why we have this inconsistency in the non-atomic arguments.
We don't actually have any double variants and the argument `D` is actually a 
`float *`, even though the naming convention used suggests that it should've 
been either a `double *` or should be called `F`.




Comment at: clang/test/Sema/atomic-ops.c:218
   __atomic_fetch_sub(s1, 3, memory_order_seq_cst); // expected-error {{must be 
a pointer to integer, pointer or supported floating point type}}
-  __atomic_fetch_min(D, 3, memory_order_seq_cst); // expected-error {{must be 
a pointer to integer}}
-  __atomic_fetch_max(P, 3, memory_order_seq_cst); // expected-error {{must be 
a pointer to integer}}
+  __atomic_fetch_min(D, 3, memory_order_seq_cst);
+  __atomic_fetch_max(P, 3, memory_order_seq_cst); // expected-error {{must be 
a pointer to integer or supported floating point type}}

We seem to be missing the tests for `double *` here, too.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150985/new/

https://reviews.llvm.org/D150985

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150985: [clang] Allow fp in atomic fetch max/min builtins

2023-05-31 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Sema/SemaChecking.cpp:6576-6578
   if (!ValType->isFloatingType())
 return false;
+  if (!(AllowedType & AOAVT_FP))

Collapse into a single if statement: `if (!(ValType->isFloatingType() && 
(AllowedType & AOAVT_FP)))`



Comment at: clang/lib/Sema/SemaChecking.cpp:6588
+if (!IsAllowedValueType(ValType, ArithAllows)) {
+  assert(ArithAllows & AOAVT_Integer);
+  auto DID = ArithAllows & AOAVT_FP

Why do we expect a failed `IsAllowedValueType` check to fail only if we were 
allowed integers? Is that because we assume that all atomic instructions 
support integers?

If that's the case, I'd hoist the assertion and apply it right after we're done 
setting `ArithAllows`. Alternatively, we could discard `AOAVT_Integer` and call 
the enum `ArithOpExtraValueType`. Tracking a bit that's always set does not buy 
us much, though it does make the code a bit more uniform. Up to you.





CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150985/new/

https://reviews.llvm.org/D150985

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-30 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG6cdc07a701ee: [CUDA] correctly install 
cuda_wrappers/bits/shared_ptr_base.h (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

Files:
  clang/lib/Headers/CMakeLists.txt


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} 
${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -432,7 +436,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" 
"${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" 
"${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -466,6 +470,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
@@ -508,6 +517,12 @@
   EXCLUDE_FROM_ALL
   COMPONENT cuda-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  EXCLUDE_FROM_ALL
+  COMPONENT cuda-resource-headers)
+
 install(
   FILES ${cuda_files}
   DESTINATION ${header_install_dir}


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -432,7 +436,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" "${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" "${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -466,6 +470,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
@@ -508,6 +517,12 @@
   EXCLUDE_FROM_ALL
   COMPONENT cuda-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  EXCLUDE_FROM_ALL
+  COMPONENT cuda-resource-headers)
+
 install(
   FILES ${cuda_files}
   DESTINATION ${header_install_dir}
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-30 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

@qiongsiwu1 : I've updated the patch. PTAL.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-30 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 526697.
tra added a comment.

Updated according to comments.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

Files:
  clang/lib/Headers/CMakeLists.txt


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} 
${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -429,7 +433,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" 
"${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" 
"${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -463,6 +467,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
@@ -505,6 +514,12 @@
   EXCLUDE_FROM_ALL
   COMPONENT cuda-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  EXCLUDE_FROM_ALL
+  COMPONENT cuda-resource-headers)
+
 install(
   FILES ${cuda_files}
   DESTINATION ${header_install_dir}


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -429,7 +433,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" "${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" "${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -463,6 +467,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
@@ -505,6 +514,12 @@
   EXCLUDE_FROM_ALL
   COMPONENT cuda-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  EXCLUDE_FROM_ALL
+  COMPONENT cuda-resource-headers)
+
 install(
   FILES ${cuda_files}
   DESTINATION ${header_install_dir}
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151606: [NFC][CLANG] Fix Static Code Analyzer Concerns with bad bit right shift operation in getNVPTXLaneID()

2023-05-30 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

In practice we're guaranteed by GPU architecture that the warp size will always 
be small enough to fit in 32 bits.

Also `log2_32` will never return a value larger than 32.

Does this assert help with anything else other than potential undefined 
behavior?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151606/new/

https://reviews.llvm.org/D151606

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151349: [HIP] emit macro `__HIP_NO_IMAGE_SUPPORT`

2023-05-30 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Basic/Targets/AMDGPU.cpp:248
+  auto ISAVer = llvm::AMDGPU::getIsaVersion(Opts.CPU);
+  HasImage = ISAVer.Major != 9 || ISAVer.Minor != 4;
 }

My usual nit about negations: `!(ISAVer.Major == 9 && ISAVer.Minor == 4)` is 
easier to read.

Is ISA 9.4 the only version w/o image support? Or should it be a range 
comparison instead?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151349/new/

https://reviews.llvm.org/D151349

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-26 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/lib/Headers/CMakeLists.txt:516
   COMPONENT cuda-resource-headers)
 
 install(

qiongsiwu1 wrote:
> qiongsiwu1 wrote:
> > tra wrote:
> > > qiongsiwu1 wrote:
> > > > Do we need an install target for `${cuda_wrapper_bits_files}` for the 
> > > > `cuda-resource-headers` component as well? It seems to be the case 
> > > > because this patch is treating `${cuda_wrapper_bits_files}` as part of 
> > > > `cuda-resource-headers`.
> > > > 
> > > > ```
> > > > add_header_target("cuda-resource-headers" 
> > > > "${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
> > > > ```
> > > > 
> > > > 
> > > I'm not sure I understand the question. Are you saying that a separate 
> > > `install()` for the 'bits' is not necessary and we could just install all 
> > > headers with a single `install` above?
> > > 
> > > If that's the case, then, AFAICT, the answer is that we do need a 
> > > separate `install`. 
> > > `install(FILES)` does not preserve the directory structure and dumps all 
> > > files listed in `FILES`, regardless if they are in different directories 
> > > into the same DESTINATION directory.
> > > That is exactly the problem this patch is intended to fix. We do need to 
> > > place the file under `cuda_wrappers/bits/` directory and that's why we 
> > > have separate `install(DESTINATION 
> > > ${header_install_dir}/cuda_wrappers/bits)` here.
> > > 
> > > `install(DIRECTORY)` would presumably preserve the source directory 
> > > structure, but we lose per-file granularity. It may work for the files 
> > > under cuda_wrappers for now, but I think there's some merit in explicitly 
> > > controlling which headers we ship and where we put them. While we do have 
> > > 1:1 mapping between the source tree and install tree, it may not always 
> > > be the case.
> > > 
> > > 
> > > 
> > Ah sorry for the confusion. 
> > 
> > > Are you saying that a separate install() for the 'bits' is not necessary 
> > > and we could just install all headers with a single install above?
> > 
> > No I am trying to say the opposite. I am suggesting we //add// the separate 
> > install target as a component of `clang-resource-headers` //and// as a 
> > component of `cuda-resource-headers`, as shown in the code change suggested 
> > in the comment above. I am not suggesting any code form this patch to be 
> > removed. The `cuda-resource-headers` can be used to install the cuda 
> > related headers only, in the case when a user do not want to install all 
> > the headers (e.g. if a user only want to install support for Intel and 
> > Nvidia headers, but not the PowerPC headers, the user can select 
> > `core-resource-headers`, `x86_files` and `cuda-resource-headers` during a 
> > distribution build/install). I think without the code change suggested 
> > above, if a user select to install `cuda-resource-headers` only without 
> > specifying `clang-resource-headers`, we will miss the file 
> > `cuda_wrappers/bits/shared_ptr_base.h`. 
> Sorry I made a typo in the previous comment. I meant `x86-resource-headers` 
> when I said `x86_files`. 
I think understand now.
`cmake -DCOMPONENT=cuda-resource-headers -P ./cmake_install.cmake` indeed does 
not install the bits component.

I've added the install with `COMPONENT clang-resource-headers` and verified 
that the bits header is installed during individual component installation.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-26 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 526227.
tra added a comment.

Verified that install works correctly with
individual component installations:

  cmake -DCOMPONENT=cuda-resource-headers -P ./cmake_install.cmake
  cmake -DCOMPONENT=clang-resource-headers -P ./cmake_install.cmake


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

Files:
  clang/lib/Headers/CMakeLists.txt


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} 
${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -405,6 +409,7 @@
  "arm-resource-headers"
  "aarch64-resource-headers"
  "cuda-resource-headers"
+ "cuda-resource-bits-headers"
  "hexagon-resource-headers"
  "hip-resource-headers"
  "hlsl-resource-headers"
@@ -429,7 +434,8 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" 
"${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" 
"${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
+add_header_target("cuda-resource-bits-headers" "${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -463,6 +469,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
@@ -505,6 +516,12 @@
   EXCLUDE_FROM_ALL
   COMPONENT cuda-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  EXCLUDE_FROM_ALL
+  COMPONENT cuda-resource-headers)
+
 install(
   FILES ${cuda_files}
   DESTINATION ${header_install_dir}
@@ -650,6 +667,9 @@
   add_llvm_install_targets(install-cuda-resource-headers
DEPENDS cuda-resource-headers
COMPONENT cuda-resource-headers)
+  add_llvm_install_targets(install-cuda-resource-bits-headers
+   DEPENDS cuda-resource-bits-headers
+   COMPONENT cuda-resource-headers)
   add_llvm_install_targets(install-hexagon-resource-headers
DEPENDS hexagon-resource-headers
COMPONENT hexagon-resource-headers)


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -405,6 +409,7 @@
  "arm-resource-headers"
  "aarch64-resource-headers"
  "cuda-resource-headers"
+ "cuda-resource-bits-headers"
  "hexagon-resource-headers"
  "hip-resource-headers"
  "hlsl-resource-headers"
@@ -429,7 +434,8 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" "${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" "${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers"

[PATCH] D144911: adding bf16 support to NVPTX

2023-05-26 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Here's a rough proof-of-concept patch coalescing i16/f16/bf16 to use the same 
Int16Regs register class: https://reviews.llvm.org/D151601

The changes are largely mechanical, replacing `%h` -> `%rs` in the tests and 
eliminating special cases we previously had for Float16Registers. I'll extend 
the patch to v2f16/Int32Regs next week.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D144911: adding bf16 support to NVPTX

2023-05-26 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.

Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:604
   def int_nvvm_f # operation # variant :
 ClangBuiltin,
 DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_i16_ty, llvm_i16_ty],

tra wrote:
> Availability of these new instructions is conditional on specific CUDA 
> version and the GPU variant we're compiling for,
> Such builtins are normally implemented on the clang size as a 
> `TARGET_BUILTIN()` with appropriate constraints.
> 
> Without that `ClangBuiltin` may automatically add enough glue to make them 
> available in clang unconditionally, which would result in compiler crashing 
> if a user tries to use one of those builtins with a wrong GPU or CUDA 
> version. We want to emit a diagnostics, not cause a compiler crash.
> 
> Usually such related LLVM and clang changes should be part of the same patch.
> 
> This applies to the new intrinsic variants added below, too.
I do not think it's is done. 

Can you check what happens if you try to call any of bf16 builtins while 
compiling for sm_60? Ideally we should produce a sensible error that the 
builtin is not available.

I suspect we will fail in LLVM when we'll fail to lower the intrinsic, ot in 
nvptx if we've managed to lower it to an instruction unsupported by sm_60.

Comment at: llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp:1297-1304
 if (EltVT == MVT::f16 && N->getValueType(0) == MVT::v2f16) {
   assert(NumElts % 2 == 0 && "Vector must have even number of elements");
   EltVT = MVT::v2f16;
   NumElts /= 2;
+} else if (EltVT == MVT::bf16 && N->getValueType(0) == MVT::v2bf16) {
+  assert(NumElts % 2 == 0 && "Vector must have even number of elements");
+  EltVT = MVT::v2bf16;

These could be collapsed into 
```
if ((EltVT == MVT::f16 && N->getValueType(0) == MVT::v2f16) || 
 (EltVT == MVT::bf16 && N->getValueType(0) == MVT::v2bf16) ) {
  assert(NumElts % 2 == 0 && "Vector must have even number of elements");
  EltVT = N->getValueType(0);
  NumElts /= 2;
}
```

Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:147-153
+  switch (VT.SimpleTy) {
+  default:
+return false;
+  case MVT::v2f16:
+  case MVT::v2bf16:
+return true;
+  }

It can be simplified to just `return (VT.SimpleTy == MVT::v2f16 || VT.SimpleTy 
== MVT::v2bf16);`

Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:156
+
+static bool Isf16Orbf16Type(MVT VT) {
+  switch (VT.SimpleTy) {

ditto.

Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:623

+  for (const auto  : {ISD::FADD, ISD::FMUL, ISD::FSUB, ISD::FMA}) {
+setBF16OperationAction(Op, MVT::bf16, Legal, Promote);

Fold it into the loop above.

Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:699
   }
+  for (const auto  : {ISD::FMINNUM, ISD::FMAXNUM}) {
+setBF16OperationAction(Op, MVT::bf16, GetMinMaxAction(Promote), Promote);

Fold into the loop processing `{ISD::FMINNUM, ISD::FMAXNUM}` above.

Also, do we want/need to add bf16 handling for `{ISD::FMINIMUM, ISD::FMAXIMUM}` 
too?

The LLVM's choice of constants `FMINIMUM` vs `FMINNUM` is rather unfortunate -- 
it's so easy to misread one for another.

Comment at: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:700-703
+setBF16OperationAction(Op, MVT::bf16, GetMinMaxAction(Promote), Promote);
+setBF16OperationAction(Op, MVT::v2bf16, GetMinMaxAction(Expand), Expand);
+setBF16OperationAction(Op, MVT::bf16, GetMinMaxAction(Expand), Expand);
+setBF16OperationAction(Op, MVT::v2bf16, GetMinMaxAction(Expand), Expand);

I'm not sure what's going on here. Should it be Promote for bf16 and Expand for 
v2bf16? Why do we have two other entries, one of them trying to Expand bf16?

Comment at: llvm/lib/Target/NVPTX/NVPTXRegisterInfo.td:65-66
+def Float16x2Regs : NVPTXRegClass<[v2f16], 32, (add (sequence "HH%u", 0, 4))>;
+def BFloat16Regs : NVPTXRegClass<[bf16], 16, (add (sequence "H%u", 0, 4))>;
+def BFloat16x2Regs : NVPTXRegClass<[v2bf16], 32, (add (sequence "HH%u", 0, 
4))>;
 def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;

I suspect this may be a problem.

What PTX do we end up generating if we have a function that needs to use both 
f16 and bf16 registers? I suspect we may end up with defining conflicting sets 
of registers.

I still do not think that we need a spearate register class for bf16 and both 
bf16 and fp16 should be using a generic opaque 16/32 bit register types (or, 
even better, generic Int16/Int32 registers. 

RegClass accepts multiple type values, so it may be as simple as using `def 
Int16Regs : NVPTXRegClass<[i16,f16,bf16], 16, (add (sequence "RS%u", 0, 4))>;` 
and adjusting existing use cases.

That should probably be done as a separate

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-26 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.

Comment at: clang/lib/Headers/CMakeLists.txt:516
   COMPONENT cuda-resource-headers)

 install(

qiongsiwu1 wrote:
> Do we need an install target for `${cuda_wrapper_bits_files}` for the 
> `cuda-resource-headers` component as well? It seems to be the case because 
> this patch is treating `${cuda_wrapper_bits_files}` as part of 
> `cuda-resource-headers`.
> 
> ```
> add_header_target("cuda-resource-headers" 
> "${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
> ```
> 
> 
I'm not sure I understand the question. Are you saying that a separate 
`install()` for the 'bits' is not necessary and we could just install all 
headers with a single `install` above?

If that's the case, then, AFAICT, the answer is that we do need a separate 
`install`. 
`install(FILES)` does not preserve the directory structure and dumps all files 
listed in `FILES`, regardless if they are in different directories into the 
same DESTINATION directory.
That is exactly the problem this patch is intended to fix. We do need to place 
the file under `cuda_wrappers/bits/` directory and that's why we have separate 
`install(DESTINATION ${header_install_dir}/cuda_wrappers/bits)` here.

`install(DIRECTORY)` would presumably preserve the source directory structure, 
but we lose per-file granularity. It may work for the files under cuda_wrappers 
for now, but I think there's some merit in explicitly controlling which headers 
we ship and where we put them. While we do have 1:1 mapping between the source 
tree and install tree, it may not always be the case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151503/new/

https://reviews.llvm.org/D151503

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151503: [CUDA] correctly install cuda_wrappers/bits/shared_ptr_base.h

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, carlosgalvezp, bixia, yaxunl.
Herald added a project: All.
tra edited the summary of this revision.
tra edited the summary of this revision.
tra published this revision for review.
tra added reviewers: qiongsiwu1, jlebar.
Herald added a reviewer: jdoerfert.
Herald added subscribers: cfe-commits, jplehr, sstefan1.
Herald added a project: clang.

The file must go under cuda_wrappers/bits/, but was copied
directly into cuda_wrappers/ during installation.

https://github.com/llvm/llvm-project/issues/62939


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151503

Files:
  clang/lib/Headers/CMakeLists.txt


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} 
${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -429,7 +433,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" 
"${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" 
"${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" 
"${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -463,6 +467,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers


Index: clang/lib/Headers/CMakeLists.txt
===
--- clang/lib/Headers/CMakeLists.txt
+++ clang/lib/Headers/CMakeLists.txt
@@ -267,6 +267,9 @@
   cuda_wrappers/cmath
   cuda_wrappers/complex
   cuda_wrappers/new
+)
+
+set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
 )
 
@@ -328,7 +331,8 @@
 
 
 # Copy header files from the source directory to the build directory
-foreach( f ${files} ${cuda_wrapper_files} ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
+foreach( f ${files} ${cuda_wrapper_files} ${cuda_wrapper_bits_files}
+   ${ppc_wrapper_files} ${openmp_wrapper_files} ${hlsl_files})
   copy_header_to_output_dir(${CMAKE_CURRENT_SOURCE_DIR} ${f})
 endforeach( f )
 
@@ -429,7 +433,7 @@
 # Architecture/platform specific targets
 add_header_target("arm-resource-headers" "${arm_only_files};${arm_only_generated_files}")
 add_header_target("aarch64-resource-headers" "${aarch64_only_files};${aarch64_only_generated_files}")
-add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files}")
+add_header_target("cuda-resource-headers" "${cuda_files};${cuda_wrapper_files};${cuda_wrapper_bits_files}")
 add_header_target("hexagon-resource-headers" "${hexagon_files}")
 add_header_target("hip-resource-headers" "${hip_files}")
 add_header_target("loongarch-resource-headers" "${loongarch_files}")
@@ -463,6 +467,11 @@
   DESTINATION ${header_install_dir}/cuda_wrappers
   COMPONENT clang-resource-headers)
 
+install(
+  FILES ${cuda_wrapper_bits_files}
+  DESTINATION ${header_install_dir}/cuda_wrappers/bits
+  COMPONENT clang-resource-headers)
+
 install(
   FILES ${ppc_wrapper_files}
   DESTINATION ${header_install_dir}/ppc_wrappers
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151362: [CUDA] Add CUDA wrappers over clang builtins for sm_90.

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

This revision was automatically updated to reflect the committed changes.
Closed by commit rG5c082e7e15e3: [CUDA] Add CUDA wrappers over clang builtins 
for sm_90. (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151362/new/

https://reviews.llvm.org/D151362

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h

Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -577,6 +577,133 @@
 }
 #endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
 
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 900
+__device__ inline unsigned __isCtaShared(const void *ptr) {
+  return __isShared(ptr);
+}
+
+__device__ inline unsigned __isClusterShared(const void *__ptr) {
+  return __nvvm_isspacep_shared_cluster(__ptr);
+}
+
+__device__ inline void *__cluster_map_shared_rank(const void *__ptr,
+  unsigned __rank) {
+  return __nvvm_mapa((void *)__ptr, __rank);
+}
+
+__device__ inline unsigned __cluster_query_shared_rank(const void *__ptr) {
+  return __nvvm_getctarank((void *)__ptr);
+}
+
+__device__ inline uint2
+__cluster_map_shared_multicast(const void *__ptr,
+   unsigned int __cluster_cta_mask) {
+  return make_uint2((unsigned)__cvta_generic_to_shared(__ptr),
+__cluster_cta_mask);
+}
+
+__device__ inline unsigned __clusterDimIsSpecified() {
+  return __nvvm_is_explicit_cluster();
+}
+
+__device__ inline dim3 __clusterDim() {
+  return {__nvvm_read_ptx_sreg_cluster_nctaid_x(),
+  __nvvm_read_ptx_sreg_cluster_nctaid_y(),
+  __nvvm_read_ptx_sreg_cluster_nctaid_z()};
+}
+
+__device__ inline dim3 __clusterRelativeBlockIdx() {
+  return {__nvvm_read_ptx_sreg_cluster_ctaid_x(),
+  __nvvm_read_ptx_sreg_cluster_ctaid_y(),
+  __nvvm_read_ptx_sreg_cluster_ctaid_z()};
+}
+
+__device__ inline dim3 __clusterGridDimInClusters() {
+  return {__nvvm_read_ptx_sreg_nclusterid_x(),
+  __nvvm_read_ptx_sreg_nclusterid_y(),
+  __nvvm_read_ptx_sreg_nclusterid_z()};
+}
+
+__device__ inline dim3 __clusterIdx() {
+  return {__nvvm_read_ptx_sreg_clusterid_x(),
+  __nvvm_read_ptx_sreg_clusterid_y(),
+  __nvvm_read_ptx_sreg_clusterid_z()};
+}
+
+__device__ inline unsigned __clusterRelativeBlockRank() {
+  return __nvvm_read_ptx_sreg_cluster_ctarank();
+}
+
+__device__ inline unsigned __clusterSizeInBlocks() {
+  return __nvvm_read_ptx_sreg_cluster_nctarank();
+}
+
+__device__ inline void __cluster_barrier_arrive() {
+  __nvvm_barrier_cluster_arrive();
+}
+
+__device__ inline void __cluster_barrier_arrive_relaxed() {
+  __nvvm_barrier_cluster_arrive_relaxed();
+}
+
+__device__ inline void __cluster_barrier_wait() {
+  __nvvm_barrier_cluster_wait();
+}
+
+__device__ inline void __threadfence_cluster() { __nvvm_fence_sc_cluster(); }
+
+__device__ inline float2 atomicAdd(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float2 atomicAdd_block(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.cta.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float2 atomicAdd_system(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.sys.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__("atom.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y), "f"(__val.z), "f"(__val.w));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd_block(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__(
+  "atom.cta.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y), "f"(__val.z), "f"(__val.w));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd_system(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__(
+  "atom.sys.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y), "f"(__val.z), "f"(__val.w)
+  :);
+  return __ret;
+}
+
+#endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 900
 #endif // CUDA_VERSION >= 11000
 
 #endif //

[PATCH] D151363: [NVPTX, CUDA] barrier intrinsics and builtins for sm_90

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

This revision was automatically updated to reflect the committed changes.
Closed by commit rG25708b3df6e3: [NVPTX, CUDA] barrier intrinsics and builtins 
for sm_90 (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151363/new/

https://reviews.llvm.org/D151363

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenCUDA/builtins-sm90.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll

Index: llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
===
--- llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
+++ llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
@@ -1,5 +1,5 @@
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK %s
-; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
 
 ; CHECK-LABEL: test_isspacep
 define i1 @test_isspacep_shared_cluster(ptr %p) {
@@ -120,6 +120,19 @@
 ret i1 %x
 }
 
+; CHECK-LABEL: test_barrier_cluster(
+define void @test_barrier_cluster() {
+; CHECK: barrier.cluster.arrive;
+   call void @llvm.nvvm.barrier.cluster.arrive()
+; CHECK: barrier.cluster.arrive.relaxed;
+   call void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+; CHECK: barrier.cluster.wait;
+   call void @llvm.nvvm.barrier.cluster.wait()
+; CHECK: fence.sc.cluster
+   call void @llvm.nvvm.fence.sc.cluster()
+   ret void
+}
+
 
 declare i1 @llvm.nvvm.isspacep.shared.cluster(ptr %p);
 declare ptr @llvm.nvvm.mapa(ptr %p, i32 %r);
@@ -137,3 +150,7 @@
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.ctarank()
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.nctarank()
 declare i1 @llvm.nvvm.is_explicit_cluster()
+declare void @llvm.nvvm.barrier.cluster.arrive()
+declare void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+declare void @llvm.nvvm.barrier.cluster.wait()
+declare void @llvm.nvvm.fence.sc.cluster()
Index: llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
===
--- llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -132,6 +132,18 @@
  "barrier.sync \t$id, $cnt;",
  [(int_nvvm_barrier_sync_cnt imm:$id, imm:$cnt)]>,
 Requires<[hasPTX<60>, hasSM<30>]>;
+class INT_BARRIER_CLUSTER Preds = [hasPTX<78>, hasSM<90>]>:
+NVPTXInst<(outs), (ins), "barrier.cluster."# variant #";", [(Intr)]>,
+Requires;
+
+def barrier_cluster_arrive:
+INT_BARRIER_CLUSTER<"arrive", int_nvvm_barrier_cluster_arrive>;
+def barrier_cluster_arrive_relaxed:
+INT_BARRIER_CLUSTER<"arrive.relaxed",
+int_nvvm_barrier_cluster_arrive_relaxed, [hasPTX<80>, hasSM<90>]>;
+def barrier_cluster_wait:
+INT_BARRIER_CLUSTER<"wait", int_nvvm_barrier_cluster_wait>;
 
 class SHFL_INSTR
@@ -303,6 +315,9 @@
 def INT_MEMBAR_GL  : MEMBAR<"membar.gl;",  int_nvvm_membar_gl>;
 def INT_MEMBAR_SYS : MEMBAR<"membar.sys;", int_nvvm_membar_sys>;
 
+def INT_FENCE_SC_CLUSTER:
+   MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
+   Requires<[hasPTX<78>, hasSM<90>]>;
 
 //---
 // Async Copy Functions
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1358,6 +1358,14 @@
   Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>,
   ClangBuiltin<"__nvvm_barrier_sync_cnt">;
 
+  // barrier.cluster.[wait, arrive, arrive.relaxed]
+  def int_nvvm_barrier_cluster_arrive :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_arrive_relaxed :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_wait :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+
   // Membar
   def int_nvvm_membar_cta : ClangBuiltin<"__nvvm_membar_cta">,
   Intrinsic<[], [], [IntrNoCallback]>;
@@ -1365,6 +1373,8 @@
   Intrinsic<[], [], [IntrNoCallback]>;
   def int_nvvm_membar_sys : ClangBuiltin<"__nvvm_membar_sys">,
   Intrinsic<[], [], [IntrNoCallback]>;
+  def int_nvvm_fence_sc_cluster:
+  Intrinsic<[], [], [IntrNoCallback]>;
 
 // Async Copy
 def int_nvvm_cp_async_mbarrier_arrive :
Index: clang/test/CodeGenCUDA/builtins-sm90.cu
===
--- clang/test/CodeGenCUDA/builtins-sm90.cu
+++ clang/test/CodeGenCUDA/builtins-sm90.cu
@@ -1,4 +1,4 @@
-// RUN: %clang_cc1 "-triple" "nvptx64-nvidia-cuda" "-target-feature"

[PATCH] D151168: [CUDA] plumb through new sm_90-specific builtins.

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG0a0bae1e9f94: [CUDA] plumb through new sm_90-specific 
builtins. (authored by tra).

Changed prior to commit:
  https://reviews.llvm.org/D151168?vs=524516=525737#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151168/new/

https://reviews.llvm.org/D151168

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenCUDA/builtins-sm90.cu

Index: clang/test/CodeGenCUDA/builtins-sm90.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/builtins-sm90.cu
@@ -0,0 +1,61 @@
+// RUN: %clang_cc1 "-triple" "nvptx64-nvidia-cuda" "-target-feature" "+ptx78" "-target-cpu" "sm_90" -emit-llvm -fcuda-is-device -o - %s | FileCheck %s
+
+// CHECK: define{{.*}} void @_Z6kernelPlPvj(
+__attribute__((global)) void kernel(long *out, void *ptr, unsigned u) {
+  int i = 0;
+  // CHECK: call i1 @llvm.nvvm.isspacep.shared.cluster
+  out[i++] = __nvvm_isspacep_shared_cluster(ptr);
+
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.x()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.y()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.z()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.w()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_w();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.x()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.y()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.z()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.w()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_w();
+
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctaid.x()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctaid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctaid.y()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctaid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctaid.z()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctaid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctaid.w()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctaid_w();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctaid.x()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctaid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctaid.y()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctaid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctaid.z()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctaid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctaid.w()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctaid_w();
+
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctarank()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctarank();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctarank()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctarank();
+  // CHECK: call i1 @llvm.nvvm.is_explicit_cluster()
+  out[i++] = __nvvm_is_explicit_cluster();
+
+  auto * sptr = (__attribute__((address_space(3))) void *)ptr;
+  // CHECK: call ptr @llvm.nvvm.mapa(ptr %{{.*}}, i32 %{{.*}})
+  out[i++] = (long) __nvvm_mapa(ptr, u);
+  // CHECK: call ptr addrspace(3) @llvm.nvvm.mapa.shared.cluster(ptr addrspace(3) %{{.*}}, i32 %{{.*}})
+  out[i++] = (long) __nvvm_mapa_shared_cluster(sptr, u);
+  // CHECK: call i32 @llvm.nvvm.getctarank(ptr {{.*}})
+  out[i++] = __nvvm_getctarank(ptr);
+  // CHECK: call i32 @llvm.nvvm.getctarank.shared.cluster(ptr addrspace(3) {{.*}})
+  out[i++] = __nvvm_getctarank_shared_cluster(sptr);
+
+  // CHECK: ret void
+}
Index: clang/lib/CodeGen/CGBuiltin.cpp
===
--- clang/lib/CodeGen/CGBuiltin.cpp
+++ clang/lib/CodeGen/CGBuiltin.cpp
@@ -18885,6 +18885,83 @@
 return MakeCpAsync(Intrinsic::nvvm_cp_async_cg_shared_global_16,
Intrinsic::nvvm_cp_async_cg_shared_global_16_s, *this, E,
16);
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_x:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_x));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_y:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_y));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_z:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_z));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_w:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_w));
+  case

[PATCH] D151361: [CUDA] bump supported CUDA version to 12.1/11.8

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

This revision was automatically updated to reflect the committed changes.
Closed by commit rGffb635cb2d4e: [CUDA] bump supported CUDA version to 
12.1/11.8 (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151361/new/

https://reviews.llvm.org/D151361

Files:
  clang/docs/ReleaseNotes.rst
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/include/clang/Basic/Cuda.h
  clang/lib/Basic/Cuda.cpp
  clang/lib/Driver/ToolChains/Cuda.cpp
  llvm/lib/Target/NVPTX/NVPTX.td

Index: llvm/lib/Target/NVPTX/NVPTX.td
===
--- llvm/lib/Target/NVPTX/NVPTX.td
+++ llvm/lib/Target/NVPTX/NVPTX.td
@@ -24,89 +24,22 @@
 //   TableGen in NVPTXGenSubtarget.inc.
 //===--===//
 
-// SM Versions
-def SM20 : SubtargetFeature<"sm_20", "SmVersion", "20",
-"Target SM 2.0">;
-def SM21 : SubtargetFeature<"sm_21", "SmVersion", "21",
-"Target SM 2.1">;
-def SM30 : SubtargetFeature<"sm_30", "SmVersion", "30",
-"Target SM 3.0">;
-def SM32 : SubtargetFeature<"sm_32", "SmVersion", "32",
-"Target SM 3.2">;
-def SM35 : SubtargetFeature<"sm_35", "SmVersion", "35",
-"Target SM 3.5">;
-def SM37 : SubtargetFeature<"sm_37", "SmVersion", "37",
-"Target SM 3.7">;
-def SM50 : SubtargetFeature<"sm_50", "SmVersion", "50",
-"Target SM 5.0">;
-def SM52 : SubtargetFeature<"sm_52", "SmVersion", "52",
-"Target SM 5.2">;
-def SM53 : SubtargetFeature<"sm_53", "SmVersion", "53",
-"Target SM 5.3">;
-def SM60 : SubtargetFeature<"sm_60", "SmVersion", "60",
- "Target SM 6.0">;
-def SM61 : SubtargetFeature<"sm_61", "SmVersion", "61",
- "Target SM 6.1">;
-def SM62 : SubtargetFeature<"sm_62", "SmVersion", "62",
- "Target SM 6.2">;
-def SM70 : SubtargetFeature<"sm_70", "SmVersion", "70",
- "Target SM 7.0">;
-def SM72 : SubtargetFeature<"sm_72", "SmVersion", "72",
- "Target SM 7.2">;
-def SM75 : SubtargetFeature<"sm_75", "SmVersion", "75",
- "Target SM 7.5">;
-def SM80 : SubtargetFeature<"sm_80", "SmVersion", "80",
- "Target SM 8.0">;
-def SM86 : SubtargetFeature<"sm_86", "SmVersion", "86",
- "Target SM 8.6">;
-def SM87 : SubtargetFeature<"sm_87", "SmVersion", "87",
- "Target SM 8.7">;
-def SM89 : SubtargetFeature<"sm_89", "SmVersion", "89",
- "Target SM 8.9">;
-def SM90 : SubtargetFeature<"sm_90", "SmVersion", "90",
- "Target SM 9.0">;
+class FeatureSM:
+   SubtargetFeature<"sm_"# version, "SmVersion",
+"" # version,
+"Target SM " # version>;
+class FeaturePTX:
+   SubtargetFeature<"ptx"# version, "PTXVersion",
+"" # version,
+"Use PTX version " # version>;
 
-// PTX Versions
-def PTX32 : SubtargetFeature<"ptx32", "PTXVersion", "32",
- "Use PTX version 3.2">;
-def PTX40 : SubtargetFeature<"ptx40", "PTXVersion", "40",
- "Use PTX version 4.0">;
-def PTX41 : SubtargetFeature<"ptx41", "PTXVersion", "41",
- "Use PTX version 4.1">;
-def PTX42 : SubtargetFeature<"ptx42", "PTXVersion", "42",
- "Use PTX version 4.2">;
-def PTX43 : SubtargetFeature<"ptx43", "PTXVersion", "43",
- "Use PTX version 4.3">;
-def PTX50 : SubtargetFeature<"ptx50", "PTXVersion", "50",
- "Use PTX version 5.0">;
-def PTX60 : SubtargetFeature<"ptx60", "PTXVersion", "60",
- "Use PTX version 6.0">;
-def PTX61 : SubtargetFeature<"ptx61", "PTXVersion", "61",
- "Use PTX version 6.1">;
-def PTX63 : SubtargetFeature<"ptx63", "PTXVersion", "63",
- "Use PTX version 6.3">;
-def PTX64 : SubtargetFeature<"ptx64", "PTXVersion", "64",
- "Use PTX version 6.4">;
-def PTX65 : SubtargetFeature<"ptx65", "PTXVersion", "65",
- "Use PTX version 6.5">;
-def PTX70 : SubtargetFeature<"ptx70", "PTXVersion", "70",
- "Use PTX version 7.0">;
-def PTX71 : SubtargetFeature<"ptx71", "PTXVersion", "71",
- "Use PTX version 7.1">;
-def PTX72 : SubtargetFeature<"ptx72", "PTXVersion", "72",
- "Use PTX version 7.2">;
-def PTX73 : SubtargetFeature<"ptx73", "PTXVersion", "73",
- "Use PTX version 7.3">;
-def PTX74 :

[PATCH] D151359: [CUDA] Relax restrictions on variadics in host-side compilation.

2023-05-25 Thread Artem Belevich via Phabricator via cfe-commits

This revision was automatically updated to reflect the committed changes.
Closed by commit rG0ad5d40fa19f: [CUDA] Relax restrictions on variadics in 
host-side compilation. (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151359/new/

https://reviews.llvm.org/D151359

Files:
  clang/lib/Driver/ToolChains/Clang.cpp


Index: clang/lib/Driver/ToolChains/Clang.cpp
===
--- clang/lib/Driver/ToolChains/Clang.cpp
+++ clang/lib/Driver/ToolChains/Clang.cpp
@@ -4677,6 +4677,13 @@
   CmdArgs.push_back(Args.MakeArgString(
   Twine("-target-sdk-version=") +
   CudaVersionToString(CTC->CudaInstallation.version(;
+// Unsized function arguments used for variadics were introduced in
+// CUDA-9.0. We still do not support generating code that actually uses
+// variadic arguments yet, but we do need to allow parsing them as
+// recent CUDA headers rely on that.
+// https://github.com/llvm/llvm-project/issues/58410
+if (CTC->CudaInstallation.version() >= CudaVersion::CUDA_90)
+  CmdArgs.push_back("-fcuda-allow-variadic-functions");
   }
 }
 CmdArgs.push_back("-aux-triple");


Index: clang/lib/Driver/ToolChains/Clang.cpp
===
--- clang/lib/Driver/ToolChains/Clang.cpp
+++ clang/lib/Driver/ToolChains/Clang.cpp
@@ -4677,6 +4677,13 @@
   CmdArgs.push_back(Args.MakeArgString(
   Twine("-target-sdk-version=") +
   CudaVersionToString(CTC->CudaInstallation.version(;
+// Unsized function arguments used for variadics were introduced in
+// CUDA-9.0. We still do not support generating code that actually uses
+// variadic arguments yet, but we do need to allow parsing them as
+// recent CUDA headers rely on that.
+// https://github.com/llvm/llvm-project/issues/58410
+if (CTC->CudaInstallation.version() >= CudaVersion::CUDA_90)
+  CmdArgs.push_back("-fcuda-allow-variadic-functions");
   }
 }
 CmdArgs.push_back("-aux-triple");
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151362: [CUDA] Add CUDA wrappers over clang builtins for sm_90.

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, bixia, yaxunl.
Herald added a project: All.
tra updated this revision to Diff 525338.
tra added a comment.
tra updated this revision to Diff 525340.
tra published this revision for review.
tra added a reviewer: jlebar.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Added vectorized fp32 atomic add.


tra added a comment.

clang-format changes.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151362

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h

Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -577,6 +577,133 @@
 }
 #endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
 
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 900
+__device__ inline unsigned __isCtaShared(const void *ptr) {
+  return __isShared(ptr);
+}
+
+__device__ inline unsigned __isClusterShared(const void *__ptr) {
+  return __nvvm_isspacep_shared_cluster(__ptr);
+}
+
+__device__ inline void *__cluster_map_shared_rank(const void *__ptr,
+  unsigned __rank) {
+  return __nvvm_mapa((void *)__ptr, __rank);
+}
+
+__device__ inline unsigned __cluster_query_shared_rank(const void *__ptr) {
+  return __nvvm_getctarank((void *)__ptr);
+}
+
+__device__ inline uint2
+__cluster_map_shared_multicast(const void *__ptr,
+   unsigned int __cluster_cta_mask) {
+  return make_uint2((unsigned)__cvta_generic_to_shared(__ptr),
+__cluster_cta_mask);
+}
+
+__device__ inline unsigned __clusterDimIsSpecified() {
+  return __nvvm_is_explicit_cluster();
+}
+
+__device__ inline dim3 __clusterDim() {
+  return {__nvvm_read_ptx_sreg_cluster_nctaid_x(),
+  __nvvm_read_ptx_sreg_cluster_nctaid_y(),
+  __nvvm_read_ptx_sreg_cluster_nctaid_z()};
+}
+
+__device__ inline dim3 __clusterRelativeBlockIdx() {
+  return {__nvvm_read_ptx_sreg_cluster_ctaid_x(),
+  __nvvm_read_ptx_sreg_cluster_ctaid_y(),
+  __nvvm_read_ptx_sreg_cluster_ctaid_z()};
+}
+
+__device__ inline dim3 __clusterGridDimInClusters() {
+  return {__nvvm_read_ptx_sreg_nclusterid_x(),
+  __nvvm_read_ptx_sreg_nclusterid_y(),
+  __nvvm_read_ptx_sreg_nclusterid_z()};
+}
+
+__device__ inline dim3 __clusterIdx() {
+  return {__nvvm_read_ptx_sreg_clusterid_x(),
+  __nvvm_read_ptx_sreg_clusterid_y(),
+  __nvvm_read_ptx_sreg_clusterid_z()};
+}
+
+__device__ inline unsigned __clusterRelativeBlockRank() {
+  return __nvvm_read_ptx_sreg_cluster_ctarank();
+}
+
+__device__ inline unsigned __clusterSizeInBlocks() {
+  return __nvvm_read_ptx_sreg_cluster_nctarank();
+}
+
+__device__ inline void __cluster_barrier_arrive() {
+  __nvvm_barrier_cluster_arrive();
+}
+
+__device__ inline void __cluster_barrier_arrive_relaxed() {
+  __nvvm_barrier_cluster_arrive_relaxed();
+}
+
+__device__ inline void __cluster_barrier_wait() {
+  __nvvm_barrier_cluster_wait();
+}
+
+__device__ inline void __threadfence_cluster() { __nvvm_fence_sc_cluster(); }
+
+__device__ inline float2 atomicAdd(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float2 atomicAdd_block(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.cta.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float2 atomicAdd_system(float2 *__ptr, float2 __val) {
+  float2 __ret;
+  __asm__("atom.sys.add.v2.f32 {%0, %1}, [%2], {%3, %4};"
+  : "=f"(__ret.x), "=f"(__ret.y)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__("atom.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y), "f"(__val.z), "f"(__val.w));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd_block(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__(
+  "atom.cta.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x), "f"(__val.y), "f"(__val.z), "f"(__val.w));
+  return __ret;
+}
+
+__device__ inline float4 atomicAdd_system(float4 *__ptr, float4 __val) {
+  float4 __ret;
+  __asm__(
+  "atom.sys.add.v4.f32 {%0, %1, %2, %3}, [%4], {%5, %6, %7, %8};"
+  : "=f"(__ret.x), "=f"(__ret.y), "=f"(__ret.z), "=f"(__ret.w)
+  : "l"(__ptr), "f"(__val.x),

[PATCH] D151363: [NVPTX, CUDA] barrier intrinsics and builtins for sm_90

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 525309.
tra added a comment.

whitespace fix.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151363/new/

https://reviews.llvm.org/D151363

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenCUDA/builtins-sm90.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll

Index: llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
===
--- llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
+++ llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
@@ -1,5 +1,5 @@
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK %s
-; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
 
 ; CHECK-LABEL: test_isspacep
 define i1 @test_isspacep_shared_cluster(ptr %p) {
@@ -120,6 +120,19 @@
 ret i1 %x
 }
 
+; CHECK-LABEL: test_barrier_cluster(
+define void @test_barrier_cluster() {
+; CHECK: barrier.cluster.arrive;
+   call void @llvm.nvvm.barrier.cluster.arrive()
+; CHECK: barrier.cluster.arrive.relaxed;
+   call void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+; CHECK: barrier.cluster.wait;
+   call void @llvm.nvvm.barrier.cluster.wait()
+; CHECK: fence.sc.cluster
+   call void @llvm.nvvm.fence.sc.cluster()
+   ret void
+}
+
 
 declare i1 @llvm.nvvm.isspacep.shared.cluster(ptr %p);
 declare ptr @llvm.nvvm.mapa(ptr %p, i32 %r);
@@ -137,3 +150,7 @@
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.ctarank()
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.nctarank()
 declare i1 @llvm.nvvm.is_explicit_cluster()
+declare void @llvm.nvvm.barrier.cluster.arrive()
+declare void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+declare void @llvm.nvvm.barrier.cluster.wait()
+declare void @llvm.nvvm.fence.sc.cluster()
Index: llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
===
--- llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -132,6 +132,18 @@
  "barrier.sync \t$id, $cnt;",
  [(int_nvvm_barrier_sync_cnt imm:$id, imm:$cnt)]>,
 Requires<[hasPTX<60>, hasSM<30>]>;
+class INT_BARRIER_CLUSTER Preds = [hasPTX<78>, hasSM<90>]>:
+NVPTXInst<(outs), (ins), "barrier.cluster."# variant #";", [(Intr)]>,
+Requires;
+
+def barrier_cluster_arrive:
+INT_BARRIER_CLUSTER<"arrive", int_nvvm_barrier_cluster_arrive>;
+def barrier_cluster_arrive_relaxed:
+INT_BARRIER_CLUSTER<"arrive.relaxed",
+int_nvvm_barrier_cluster_arrive_relaxed, [hasPTX<80>, hasSM<90>]>;
+def barrier_cluster_wait:
+INT_BARRIER_CLUSTER<"wait", int_nvvm_barrier_cluster_wait>;
 
 class SHFL_INSTR
@@ -303,6 +315,9 @@
 def INT_MEMBAR_GL  : MEMBAR<"membar.gl;",  int_nvvm_membar_gl>;
 def INT_MEMBAR_SYS : MEMBAR<"membar.sys;", int_nvvm_membar_sys>;
 
+def INT_FENCE_SC_CLUSTER:
+   MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
+   Requires<[hasPTX<78>, hasSM<90>]>;
 
 //---
 // Async Copy Functions
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1358,6 +1358,14 @@
   Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>,
   ClangBuiltin<"__nvvm_barrier_sync_cnt">;
 
+  // barrier.cluster.[wait, arrive, arrive.relaxed]
+  def int_nvvm_barrier_cluster_arrive :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_arrive_relaxed :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_wait :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+
   // Membar
   def int_nvvm_membar_cta : ClangBuiltin<"__nvvm_membar_cta">,
   Intrinsic<[], [], [IntrNoCallback]>;
@@ -1365,6 +1373,8 @@
   Intrinsic<[], [], [IntrNoCallback]>;
   def int_nvvm_membar_sys : ClangBuiltin<"__nvvm_membar_sys">,
   Intrinsic<[], [], [IntrNoCallback]>;
+  def int_nvvm_fence_sc_cluster:
+  Intrinsic<[], [], [IntrNoCallback]>;
 
 // Async Copy
 def int_nvvm_cp_async_mbarrier_arrive :
Index: clang/test/CodeGenCUDA/builtins-sm90.cu
===
--- clang/test/CodeGenCUDA/builtins-sm90.cu
+++ clang/test/CodeGenCUDA/builtins-sm90.cu
@@ -1,4 +1,4 @@
-// RUN: %clang_cc1 "-triple" "nvptx64-nvidia-cuda" "-target-feature" "+ptx78" "-target-cpu" "sm_90" -emit-llvm -fcuda-is-device -o - %s | FileCheck %s
+// RUN: %clang_cc1

[PATCH] D151363: [NVPTX, CUDA] barrier intrinsics and builtins for sm_90

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, bixia, hiraditya, 
yaxunl.
Herald added a project: All.
tra updated this revision to Diff 525307.
tra added a comment.
tra published this revision for review.
tra added a reviewer: jlebar.
Herald added subscribers: llvm-commits, cfe-commits, jdoerfert, jholewinski.
Herald added projects: clang, LLVM.

Re-enabled .relaxed test, now that ptx80 is available.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151363

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenCUDA/builtins-sm90.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll

Index: llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
===
--- llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
+++ llvm/test/CodeGen/NVPTX/intrinsics-sm90.ll
@@ -1,5 +1,5 @@
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK %s
-; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-11.8 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
 
 ; CHECK-LABEL: test_isspacep
 define i1 @test_isspacep_shared_cluster(ptr %p) {
@@ -120,6 +120,19 @@
 ret i1 %x
 }
 
+; CHECK-LABEL: test_barrier_cluster(
+define void @test_barrier_cluster() {
+; CHECK: barrier.cluster.arrive;
+   call void @llvm.nvvm.barrier.cluster.arrive()
+; CHECK: barrier.cluster.arrive.relaxed;
+   call void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+; CHECK: barrier.cluster.wait;
+   call void @llvm.nvvm.barrier.cluster.wait()
+; CHECK: fence.sc.cluster
+   call void @llvm.nvvm.fence.sc.cluster()
+   ret void
+}
+
 
 declare i1 @llvm.nvvm.isspacep.shared.cluster(ptr %p);
 declare ptr @llvm.nvvm.mapa(ptr %p, i32 %r);
@@ -137,3 +150,7 @@
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.ctarank()
 declare i32 @llvm.nvvm.read.ptx.sreg.cluster.nctarank()
 declare i1 @llvm.nvvm.is_explicit_cluster()
+declare void @llvm.nvvm.barrier.cluster.arrive()
+declare void @llvm.nvvm.barrier.cluster.arrive.relaxed()
+declare void @llvm.nvvm.barrier.cluster.wait()
+declare void @llvm.nvvm.fence.sc.cluster()
Index: llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
===
--- llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -132,6 +132,18 @@
  "barrier.sync \t$id, $cnt;",
  [(int_nvvm_barrier_sync_cnt imm:$id, imm:$cnt)]>,
 Requires<[hasPTX<60>, hasSM<30>]>;
+class INT_BARRIER_CLUSTER Preds = [hasPTX<78>, hasSM<90>]>:
+NVPTXInst<(outs), (ins), "barrier.cluster."# variant #";", [(Intr)]>,
+Requires;
+
+def barrier_cluster_arrive:
+INT_BARRIER_CLUSTER<"arrive", int_nvvm_barrier_cluster_arrive>;
+def barrier_cluster_arrive_relaxed:
+INT_BARRIER_CLUSTER<"arrive.relaxed",
+int_nvvm_barrier_cluster_arrive_relaxed, [hasPTX<80>, hasSM<90>]>;
+def barrier_cluster_wait:
+INT_BARRIER_CLUSTER<"wait", int_nvvm_barrier_cluster_wait>;
 
 class SHFL_INSTR
@@ -303,6 +315,9 @@
 def INT_MEMBAR_GL  : MEMBAR<"membar.gl;",  int_nvvm_membar_gl>;
 def INT_MEMBAR_SYS : MEMBAR<"membar.sys;", int_nvvm_membar_sys>;
 
+def INT_FENCE_SC_CLUSTER:
+   MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
+   Requires<[hasPTX<78>, hasSM<90>]>;
 
 //---
 // Async Copy Functions
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1358,6 +1358,14 @@
   Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>,
   ClangBuiltin<"__nvvm_barrier_sync_cnt">;
 
+  // barrier.cluster.[wait, arrive, arrive.relaxed]
+  def int_nvvm_barrier_cluster_arrive :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_arrive_relaxed :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+  def int_nvvm_barrier_cluster_wait :
+  Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+
   // Membar
   def int_nvvm_membar_cta : ClangBuiltin<"__nvvm_membar_cta">,
   Intrinsic<[], [], [IntrNoCallback]>;
@@ -1365,6 +1373,8 @@
   Intrinsic<[], [], [IntrNoCallback]>;
   def int_nvvm_membar_sys : ClangBuiltin<"__nvvm_membar_sys">,
   Intrinsic<[], [], [IntrNoCallback]>;
+  def int_nvvm_fence_sc_cluster:
+  Intrinsic<[], [], [IntrNoCallback]>;
 
 // Async Copy
 def int_nvvm_cp_async_mbarrier_arrive :
Index: clang/test/CodeGenCUDA/builtins-sm90.cu

[PATCH] D151361: [CUDA] bump supported CUDA version to 12.1/11.8

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, bixia, hiraditya, 
yaxunl.
Herald added a project: All.
tra published this revision for review.
tra added a reviewer: jlebar.
Herald added subscribers: llvm-commits, cfe-commits, MaskRay, jholewinski.
Herald added projects: clang, LLVM.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151361

Files:
  clang/docs/ReleaseNotes.rst
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/include/clang/Basic/Cuda.h
  clang/lib/Basic/Cuda.cpp
  clang/lib/Driver/ToolChains/Cuda.cpp
  llvm/lib/Target/NVPTX/NVPTX.td

Index: llvm/lib/Target/NVPTX/NVPTX.td
===
--- llvm/lib/Target/NVPTX/NVPTX.td
+++ llvm/lib/Target/NVPTX/NVPTX.td
@@ -24,89 +24,22 @@
 //   TableGen in NVPTXGenSubtarget.inc.
 //===--===//
 
-// SM Versions
-def SM20 : SubtargetFeature<"sm_20", "SmVersion", "20",
-"Target SM 2.0">;
-def SM21 : SubtargetFeature<"sm_21", "SmVersion", "21",
-"Target SM 2.1">;
-def SM30 : SubtargetFeature<"sm_30", "SmVersion", "30",
-"Target SM 3.0">;
-def SM32 : SubtargetFeature<"sm_32", "SmVersion", "32",
-"Target SM 3.2">;
-def SM35 : SubtargetFeature<"sm_35", "SmVersion", "35",
-"Target SM 3.5">;
-def SM37 : SubtargetFeature<"sm_37", "SmVersion", "37",
-"Target SM 3.7">;
-def SM50 : SubtargetFeature<"sm_50", "SmVersion", "50",
-"Target SM 5.0">;
-def SM52 : SubtargetFeature<"sm_52", "SmVersion", "52",
-"Target SM 5.2">;
-def SM53 : SubtargetFeature<"sm_53", "SmVersion", "53",
-"Target SM 5.3">;
-def SM60 : SubtargetFeature<"sm_60", "SmVersion", "60",
- "Target SM 6.0">;
-def SM61 : SubtargetFeature<"sm_61", "SmVersion", "61",
- "Target SM 6.1">;
-def SM62 : SubtargetFeature<"sm_62", "SmVersion", "62",
- "Target SM 6.2">;
-def SM70 : SubtargetFeature<"sm_70", "SmVersion", "70",
- "Target SM 7.0">;
-def SM72 : SubtargetFeature<"sm_72", "SmVersion", "72",
- "Target SM 7.2">;
-def SM75 : SubtargetFeature<"sm_75", "SmVersion", "75",
- "Target SM 7.5">;
-def SM80 : SubtargetFeature<"sm_80", "SmVersion", "80",
- "Target SM 8.0">;
-def SM86 : SubtargetFeature<"sm_86", "SmVersion", "86",
- "Target SM 8.6">;
-def SM87 : SubtargetFeature<"sm_87", "SmVersion", "87",
- "Target SM 8.7">;
-def SM89 : SubtargetFeature<"sm_89", "SmVersion", "89",
- "Target SM 8.9">;
-def SM90 : SubtargetFeature<"sm_90", "SmVersion", "90",
- "Target SM 9.0">;
+class FeatureSM:
+   SubtargetFeature<"sm_"# version, "SmVersion",
+"" # version,
+"Target SM " # version>;
+class FeaturePTX:
+   SubtargetFeature<"ptx"# version, "PTXVersion",
+"" # version,
+"Use PTX version " # version>;
 
-// PTX Versions
-def PTX32 : SubtargetFeature<"ptx32", "PTXVersion", "32",
- "Use PTX version 3.2">;
-def PTX40 : SubtargetFeature<"ptx40", "PTXVersion", "40",
- "Use PTX version 4.0">;
-def PTX41 : SubtargetFeature<"ptx41", "PTXVersion", "41",
- "Use PTX version 4.1">;
-def PTX42 : SubtargetFeature<"ptx42", "PTXVersion", "42",
- "Use PTX version 4.2">;
-def PTX43 : SubtargetFeature<"ptx43", "PTXVersion", "43",
- "Use PTX version 4.3">;
-def PTX50 : SubtargetFeature<"ptx50", "PTXVersion", "50",
- "Use PTX version 5.0">;
-def PTX60 : SubtargetFeature<"ptx60", "PTXVersion", "60",
- "Use PTX version 6.0">;
-def PTX61 : SubtargetFeature<"ptx61", "PTXVersion", "61",
- "Use PTX version 6.1">;
-def PTX63 : SubtargetFeature<"ptx63", "PTXVersion", "63",
- "Use PTX version 6.3">;
-def PTX64 : SubtargetFeature<"ptx64", "PTXVersion", "64",
- "Use PTX version 6.4">;
-def PTX65 : SubtargetFeature<"ptx65", "PTXVersion", "65",
- "Use PTX version 6.5">;
-def PTX70 : SubtargetFeature<"ptx70", "PTXVersion", "70",
- "Use PTX version 7.0">;
-def PTX71 : SubtargetFeature<"ptx71", "PTXVersion", "71",
- "Use PTX version 7.1">;
-def PTX72 : SubtargetFeature<"ptx72", "PTXVersion", "72",
- "Use PTX version 7.2">;
-def PTX73 : SubtargetFeature<"ptx73",

[PATCH] D151359: [CUDA] Relax restrictions on variadics in host-side compilation.

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, bixia, yaxunl.
Herald added a project: All.
tra published this revision for review.
tra added a reviewer: jlebar.
Herald added subscribers: cfe-commits, MaskRay.
Herald added a project: clang.

D150718  allows variadics during GPU 
compilation, but we also need to do it for
the host compilation as well, as it will see the same code.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151359

Files:
  clang/lib/Driver/ToolChains/Clang.cpp


Index: clang/lib/Driver/ToolChains/Clang.cpp
===
--- clang/lib/Driver/ToolChains/Clang.cpp
+++ clang/lib/Driver/ToolChains/Clang.cpp
@@ -4677,6 +4677,13 @@
   CmdArgs.push_back(Args.MakeArgString(
   Twine("-target-sdk-version=") +
   CudaVersionToString(CTC->CudaInstallation.version(;
+// Unsized function arguments used for variadics were introduced in
+// CUDA-9.0. We still do not support generating code that actually uses
+// variadic arguments yet, but we do need to allow parsing them as
+// recent CUDA headers rely on that.
+// https://github.com/llvm/llvm-project/issues/58410
+if (CTC->CudaInstallation.version() >= CudaVersion::CUDA_90)
+  CmdArgs.push_back("-fcuda-allow-variadic-functions");
   }
 }
 CmdArgs.push_back("-aux-triple");


Index: clang/lib/Driver/ToolChains/Clang.cpp
===
--- clang/lib/Driver/ToolChains/Clang.cpp
+++ clang/lib/Driver/ToolChains/Clang.cpp
@@ -4677,6 +4677,13 @@
   CmdArgs.push_back(Args.MakeArgString(
   Twine("-target-sdk-version=") +
   CudaVersionToString(CTC->CudaInstallation.version(;
+// Unsized function arguments used for variadics were introduced in
+// CUDA-9.0. We still do not support generating code that actually uses
+// variadic arguments yet, but we do need to allow parsing them as
+// recent CUDA headers rely on that.
+// https://github.com/llvm/llvm-project/issues/58410
+if (CTC->CudaInstallation.version() >= CudaVersion::CUDA_90)
+  CmdArgs.push_back("-fcuda-allow-variadic-functions");
   }
 }
 CmdArgs.push_back("-aux-triple");
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151243: [CUDA] Fix wrappers for sm_80 functions

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

This revision was automatically updated to reflect the committed changes.
Closed by commit rG29cb080c363d: [CUDA] Fix wrappers for sm_80 functions 
(authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151243/new/

https://reviews.llvm.org/D151243

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h

Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,70 +512,63 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+} // extern "C"
 
 #if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
-__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
-  return __nvvm_redux_sync_add(__mask, __value);
-}
-__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
-int __value) {
+__device__ inline unsigned __reduce_add_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_add(__mask, __value);
 }
-__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_min_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_umin(__mask, __value);
 }
-__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_max_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_umax(__mask, __value);
 }
-__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
-int __value) {
+__device__ inline int __reduce_min_sync(unsigned __mask, int __value) {
   return __nvvm_redux_sync_min(__mask, __value);
 }
-__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
-int __value) {
+__device__ inline int __reduce_max_sync(unsigned __mask, int __value) {
   return __nvvm_redux_sync_max(__mask, __value);
 }
-__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
-  unsigned __value) {
+__device__ inline unsigned __reduce_or_sync(unsigned __mask, unsigned __value) {
   return __nvvm_redux_sync_or(__mask, __value);
 }
-__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_and_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_and(__mask, __value);
 }
-__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_xor_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_xor(__mask, __value);
 }
 
-__device__ inline void
-__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
-   unsigned __src_size) {
+__device__ inline void __nv_memcpy_async_shared_global_4(void *__dst,
+ const void *__src,
+ unsigned __src_size) {
   __nvvm_cp_async_ca_shared_global_4(
   (void __attribute__((address_space(3))) *)__dst,
   (const void __attribute__((address_space(1))) *)__src, __src_size);
 }
-__device__ inline void
-__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
-   unsigned __src_size) {
+__device__ inline void __nv_memcpy_async_shared_global_8(void *__dst,
+ const void *__src,
+ unsigned __src_size) {
   __nvvm_cp_async_ca_shared_global_8(
   (void __attribute__((address_space(3))) *)__dst,
   (const void __attribute__((address_space(1))) *)__src, __src_size);
 }
-__device__ inline void
-__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
-unsigned __src_size) {
+__device__ inline void __nv_memcpy_async_shared_global_16(void *__dst,
+  const void *__src,
+

[PATCH] D150985: [clang] Allow fp in atomic fetch max/min builtins

2023-05-24 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

As I said, I'm OK with the patch in principle, I just don't know what other 
factors I may be missing.

Tests seem to be missing for c11 variants of the builtins.




Comment at: clang/test/Sema/atomic-ops.c:209
+  __atomic_fetch_min(D, 3, memory_order_seq_cst);
+  __atomic_fetch_max(P, 3, memory_order_seq_cst);
   __atomic_fetch_max(p, 3);   // expected-error {{too few 
arguments to function call, expected 3, have 2}}

Is that intentional that we now allow atomic max on a `int **P` ? My 
understanding that we were supposed to allow additional FP types only.



Comment at: clang/test/SemaOpenCL/atomic-ops.cl:65
+  __opencl_atomic_fetch_min(f, 1, memory_order_seq_cst, 
memory_scope_work_group);
+  __opencl_atomic_fetch_max(f, 1, memory_order_seq_cst, 
memory_scope_work_group);
 

We probably want to add tests for `double`, too.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150985/new/

https://reviews.llvm.org/D150985

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D151243: [CUDA] Fix wrappers for sm_80 functions

2023-05-23 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, carlosgalvezp, bixia, yaxunl.
Herald added a project: All.
tra published this revision for review.
tra added a reviewer: jlebar.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Previous implementation provided wrappers for the internal implementations used
by CUDA headers. However, clang does not include those, so we need to provide
the public functions instead.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151243

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h

Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,70 +512,63 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+} // extern "C"
 
 #if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
-__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_add_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_add(__mask, __value);
 }
-__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
-int __value) {
-  return __nvvm_redux_sync_add(__mask, __value);
-}
-__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_min_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_umin(__mask, __value);
 }
-__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_max_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_umax(__mask, __value);
 }
-__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
-int __value) {
+__device__ inline int __reduce_min_sync(unsigned __mask, int __value) {
   return __nvvm_redux_sync_min(__mask, __value);
 }
-__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
-int __value) {
+__device__ inline int __reduce_max_sync(unsigned __mask, int __value) {
   return __nvvm_redux_sync_max(__mask, __value);
 }
-__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
-  unsigned __value) {
+__device__ inline unsigned __reduce_or_sync(unsigned __mask, unsigned __value) {
   return __nvvm_redux_sync_or(__mask, __value);
 }
-__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_and_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_and(__mask, __value);
 }
-__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
-   unsigned __value) {
+__device__ inline unsigned __reduce_xor_sync(unsigned __mask,
+ unsigned __value) {
   return __nvvm_redux_sync_xor(__mask, __value);
 }
 
-__device__ inline void
-__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
-   unsigned __src_size) {
+__device__ inline void __nv_memcpy_async_shared_global_4(void *__dst,
+ const void *__src,
+ unsigned __src_size) {
   __nvvm_cp_async_ca_shared_global_4(
   (void __attribute__((address_space(3))) *)__dst,
   (const void __attribute__((address_space(1))) *)__src, __src_size);
 }
-__device__ inline void
-__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
-   unsigned __src_size) {
+__device__ inline void __nv_memcpy_async_shared_global_8(void *__dst,
+ const void *__src,
+ unsigned __src_size) {
   __nvvm_cp_async_ca_shared_global_8(
   (void __attribute__((address_space(3))) *)__dst,
   (const void __attribute__((address_space(1))) *)__src, __src_size);
 }
-__device__ inline void
-__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
-unsigned __src_size)

[PATCH] D151168: [CUDA] plumb through new sm_90-specific builtins.

2023-05-22 Thread Artem Belevich via Phabricator via cfe-commits

tra created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, bixia, yaxunl.
Herald added a project: All.
tra added a reviewer: jlebar.
tra published this revision for review.
Herald added subscribers: cfe-commits, jholewinski.
Herald added a project: clang.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151168

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenCUDA/builtins-sm90.cu

Index: clang/test/CodeGenCUDA/builtins-sm90.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/builtins-sm90.cu
@@ -0,0 +1,41 @@
+// RUN: %clang_cc1 "-triple" "nvptx64-nvidia-cuda" "-target-feature" "+ptx78" "-target-cpu" "sm_90" -emit-llvm -fcuda-is-device -o - %s | FileCheck %s
+
+// CHECK: define{{.*}} void @_Z6kernelPlPvj(
+__attribute__((global)) void kernel(long *out, void *ptr, unsigned u) {
+  int i = 0;
+  out[i++] = __nvvm_isspacep_shared_cluster(ptr);
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.x()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.y()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.z()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.clusterid.w()
+  out[i++] = __nvvm_read_ptx_sreg_clusterid_w();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.x()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_x();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.y()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_y();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.z()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_z();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.nclusterid.w()
+  out[i++] = __nvvm_read_ptx_sreg_nclusterid_w();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.ctarank()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_ctarank();
+  // CHECK: call i32 @llvm.nvvm.read.ptx.sreg.cluster.nctarank()
+  out[i++] = __nvvm_read_ptx_sreg_cluster_nctarank();
+  // CHECK: call i1 @llvm.nvvm.is_explicit_cluster()
+  out[i++] = __nvvm_is_explicit_cluster();
+
+  auto * sptr = (__attribute__((address_space(3))) void *)ptr;
+  // CHECK: call ptr @llvm.nvvm.mapa(ptr %{{.*}}, i32 %{{.*}})
+  out[i++] = (long) __nvvm_mapa(ptr, u);
+  // CHECK: call ptr addrspace(3) @llvm.nvvm.mapa.shared.cluster(ptr addrspace(3) %{{.*}}, i32 %{{.*}})
+  out[i++] = (long) __nvvm_mapa_shared_cluster(sptr, u);
+  // CHECK: call i32 @llvm.nvvm.getctarank(ptr {{.*}})
+  out[i++] = __nvvm_getctarank(ptr);
+  // CHECK: call i32 @llvm.nvvm.getctarank.shared.cluster(ptr addrspace(3) {{.*}})
+  out[i++] = __nvvm_getctarank_shared_cluster(sptr);
+
+  // CHECK: ret void
+}
Index: clang/lib/CodeGen/CGBuiltin.cpp
===
--- clang/lib/CodeGen/CGBuiltin.cpp
+++ clang/lib/CodeGen/CGBuiltin.cpp
@@ -18869,6 +18869,59 @@
 return MakeCpAsync(Intrinsic::nvvm_cp_async_cg_shared_global_16,
Intrinsic::nvvm_cp_async_cg_shared_global_16_s, *this, E,
16);
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_x:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_x));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_y:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_y));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_z:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_z));
+  case NVPTX::BI__nvvm_read_ptx_sreg_clusterid_w:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_clusterid_w));
+  case NVPTX::BI__nvvm_read_ptx_sreg_nclusterid_x:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_nclusterid_x));
+  case NVPTX::BI__nvvm_read_ptx_sreg_nclusterid_y:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_nclusterid_y));
+  case NVPTX::BI__nvvm_read_ptx_sreg_nclusterid_z:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_nclusterid_z));
+  case NVPTX::BI__nvvm_read_ptx_sreg_nclusterid_w:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_nclusterid_w));
+  case NVPTX::BI__nvvm_read_ptx_sreg_cluster_ctarank:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_cluster_ctarank));
+  case NVPTX::BI__nvvm_read_ptx_sreg_cluster_nctarank:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_read_ptx_sreg_cluster_nctarank));
+  case NVPTX::BI__nvvm_is_explicit_cluster:
+return Builder.CreateCall(
+CGM.getIntrinsic(Intrinsic::nvvm_is_explicit_cluster));
+  case NVPTX::BI__nvvm_isspacep_shared_cluster:
+return

[PATCH] D144911: adding bf16 support to NVPTX

2023-05-22 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:315-318
-} else if (RC == ::BFloat16RegsRegClass) {
-  Ret = (9 << 28);
-} else if (RC == ::BFloat16x2RegsRegClass) {
-  Ret = (10 << 28);

There's still something odd with the patch. It appears that it's a diff vs a 
previous set of the changes which did introduce BFloat16RegsRegClass.

Can you please update the diff to be vs the recent LLVM tree HEAD?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144911/new/

https://reviews.llvm.org/D144911

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150894: [CUDA] provide wrapper functions for new NVCC builtins.

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG4450285bd740: [CUDA] provide wrapper functions for new NVCC 
builtins. (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150894/new/

https://reviews.llvm.org/D150894

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_max(__mask, __value);
+}
+__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
+  unsigned __value) {
+  return __nvvm_redux_sync_or(__mask, __value);
+}
+__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_and(__mask, __value);
+}
+__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_xor(__mask, __value);
+}
+
+__device__ inline void
+__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_4(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_8(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
+unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_16(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+
+__device__ inline void *
+__nv_associate_access_property_impl(const void *__ptr,
+unsigned long long __prop) {
+  // TODO: it appears to provide compiler with some sort of a hint. We do not
+  // know what exactly it is supposed to do. However, CUDA headers suggest that
+  // just passing through __ptr should not affect correctness. They do so on
+  // pre-sm80 GPUs where this builtin is not available.
+  return (void*)__ptr;
+}
+#endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+
 } // extern "C"
 #endif // CUDA_VERSION >= 11000
 


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+

[PATCH] D150894: [CUDA] provide wrapper functions for new NVCC builtins.

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 523881.
tra added a comment.

typo fix.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150894/new/

https://reviews.llvm.org/D150894

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_max(__mask, __value);
+}
+__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
+  unsigned __value) {
+  return __nvvm_redux_sync_or(__mask, __value);
+}
+__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_and(__mask, __value);
+}
+__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_xor(__mask, __value);
+}
+
+__device__ inline void
+__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_4(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_8(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
+unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_16(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+
+__device__ inline void *
+__nv_associate_access_property_impl(const void *__ptr,
+unsigned long long __prop) {
+  // TODO: it appears to provide compiler with some sort of a hint. We do not
+  // know what exactly it is supposed to do. However, CUDA headers suggest that
+  // just passing through __ptr should not affect correctness. They do so on
+  // pre-sm80 GPUs where this builtin is not available.
+  return (void*)__ptr;
+}
+#endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+
 } // extern "C"
 #endif // CUDA_VERSION >= 11000
 


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+

[PATCH] D150894: [CUDA] provide wrapper functions for new NVCC builtins.

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 523879.
tra added a comment.

Added __nv_associate_access_property_impl() stub.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150894/new/

https://reviews.llvm.org/D150894

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_max(__mask, __value);
+}
+__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
+  unsigned __value) {
+  return __nvvm_redux_sync_or(__mask, __value);
+}
+__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_and(__mask, __value);
+}
+__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_xor(__mask, __value);
+}
+
+__device__ inline void
+__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_4(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_8(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
+unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_16(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+
+__device__ inline void *
+__nv_associate_access_property_impl(const void *__ptr,
+unsigned long long __prop) {
+  // TODO: it appears to provide compiler with some sort of a hint. We do not
+  // know what exactly it is supposed to do. However, CUDA headers suggest that
+  // just passing through __ptr should not affect correctness. The do so on
+  // pre-sm80 GPUs where this builtin is not available.
+  return (void*)__ptr;
+}
+#endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+
 } // extern "C"
 #endif // CUDA_VERSION >= 11000
 


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,78 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned

[PATCH] D150965: [HIP] Allow std::malloc in device function

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

tra added inline comments.



Comment at: clang/test/Headers/Inputs/include/math.h:108-109
 long lroundf(float __a);
-int max(int __a, int __b);
-int min(int __a, int __b);
 double modf(double __a, double *__b);

yaxunl wrote:
> tra wrote:
> > Why were these functions removed? It does not seem related to the changes 
> > in the patch?
> These functions caused failure in the added lit test.
> 
> For C++, max/min are defined as templates in . There is no max/min 
> in either standard C or C++ math.h. Their existence cause false alarms in lit 
> tests. Removing them to be consistent with standard C/C++ headers.
I suspect those may have been used for some CUDA tests. CUDA headers used to 
define `::min()` and `::max()`.

As long as it does not affect the tests, removing them is fine.




CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150965/new/

https://reviews.llvm.org/D150965

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150985: [clang] Allow fp in atomic fetch max/min builtins

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

The code changes look OK to me.

Whether allowing FP for clang builtins is OK -- I have no idea, especially for 
the c11 ones.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150985/new/

https://reviews.llvm.org/D150985

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150820: [NVPTX, CUDA] added optional src_size argument to __nvvm_cp_async*

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG6963c61f0f6e: [NVPTX/CUDA] added an optional src_size 
argument to __nvvm_cp_async* (authored by tra).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150820/new/

https://reviews.llvm.org/D150820

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/include/clang/Sema/Sema.h
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/test/CodeGen/builtins-nvptx.c
  clang/test/SemaCUDA/builtins.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/async-copy.ll

Index: llvm/test/CodeGen/NVPTX/async-copy.ll
===
--- llvm/test/CodeGen/NVPTX/async-copy.ll
+++ llvm/test/CodeGen/NVPTX/async-copy.ll
@@ -1,35 +1,35 @@
-; RUN: llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=ALL,CHECK_PTX32 %s
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=ALL,CHECK_PTX64 %s
+; RUN: llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=CHECK,CHECK_PTX32 %s
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=CHECK,CHECK_PTX64 %s
 ; RUN: %if ptxas-11.0 %{ llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | %ptxas-verify -arch=sm_80 %}
 ; RUN: %if ptxas-11.0 %{ llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | %ptxas-verify -arch=sm_80 %}
 
 declare void @llvm.nvvm.cp.async.wait.group(i32)
 
-; ALL-LABEL: asyncwaitgroup
+; CHECK-LABEL: asyncwaitgroup
 define void @asyncwaitgroup() {
-  ; ALL: cp.async.wait_group 8;
+  ; CHECK: cp.async.wait_group 8;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 8)
-  ; ALL: cp.async.wait_group 0;
+  ; CHECK: cp.async.wait_group 0;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 0)
-  ; ALL: cp.async.wait_group 16;
+  ; CHECK: cp.async.wait_group 16;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 16)
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.wait.all()
 
-; ALL-LABEL: asyncwaitall
+; CHECK-LABEL: asyncwaitall
 define void @asyncwaitall() {
-; ALL: cp.async.wait_all
+; CHECK: cp.async.wait_all
   tail call void @llvm.nvvm.cp.async.wait.all()
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.commit.group()
 
-; ALL-LABEL: asynccommitgroup
+; CHECK-LABEL: asynccommitgroup
 define void @asynccommitgroup() {
-; ALL: cp.async.commit_group
+; CHECK: cp.async.commit_group
   tail call void @llvm.nvvm.cp.async.commit.group()
   ret void
 }
@@ -41,72 +41,87 @@
 
 ; CHECK-LABEL: asyncmbarrier
 define void @asyncmbarrier(ptr %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.b64 [%rd{{[0-9]+}}];
+; The distinction between PTX32/PTX64 here is only to capture pointer register type
+; in R to be used in subsequent tests.
+; CHECK_PTX32: cp.async.mbarrier.arrive.b64 [%[[R:r]]{{[0-9]+}}];
+; CHECK_PTX64: cp.async.mbarrier.arrive.b64 [%[[R:rd]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive(ptr %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriershared
 define void @asyncmbarriershared(ptr addrspace(3) %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.shared.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.shared.b64 [%rd{{[0-9]+}}];
+; CHECK: cp.async.mbarrier.arrive.shared.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.shared(ptr addrspace(3) %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriernoinc
 define void @asyncmbarriernoinc(ptr %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.noinc.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.b64 [%rd{{[0-9]+}}];
+; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.noinc(ptr %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriernoincshared
 define void @asyncmbarriernoincshared(ptr addrspace(3) %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.noinc.shared.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.shared.b64 [%rd{{[0-9]+}}];
+; CHECK: cp.async.mbarrier.arrive.noinc.shared.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared(ptr addrspace(3) %a)
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.ca.shared.global.4(ptr addrspace(3) %a, ptr addrspace(1) %b)
+declare void @llvm.nvvm.cp.async.ca.shared.global.4.s(ptr addrspace(3) %a, ptr addrspace(1) %b, i32 %c)
 
 ; CHECK-LABEL: asynccasharedglobal4i8
-define void @asynccasharedglobal4i8(ptr addrspace(3) %a, ptr addrspace(1) %b) {
-; CHECK_PTX32: cp.async.ca.shared.global [%r{{[0-9]+}}], [%r{{[0-9]+}}], 4;
-; CHECK_PTX64: cp.async.ca.shared.global [%rd{{[0-9]+}}], [%rd{{[0-9]+}}], 4;
+define void @asynccasharedglobal4i8(ptr addrspace(3) %a, ptr addrspace(1) %b, i32 %c) {
+; CHECK:

[PATCH] D150965: [HIP] Allow std::malloc in device function

2023-05-19 Thread Artem Belevich via Phabricator via cfe-commits

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM.




Comment at: clang/test/Headers/Inputs/include/math.h:108-109
 long lroundf(float __a);
-int max(int __a, int __b);
-int min(int __a, int __b);
 double modf(double __a, double *__b);

Why were these functions removed? It does not seem related to the changes in 
the patch?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150965/new/

https://reviews.llvm.org/D150965

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150820: [NVPTX, CUDA] added optional src_size argument to __nvvm_cp_async*

2023-05-18 Thread Artem Belevich via Phabricator via cfe-commits

tra requested review of this revision.
tra added a comment.

PTAL.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150820/new/

https://reviews.llvm.org/D150820

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150820: [NVPTX, CUDA] added optional src_size argument to __nvvm_cp_async*

2023-05-18 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 523566.
tra added a comment.

Instead of changing existing intrinsic, introduce a new set which takes an
additional src_size argument. This should keep existing users working.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150820/new/

https://reviews.llvm.org/D150820

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/include/clang/Sema/Sema.h
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/test/CodeGen/builtins-nvptx.c
  clang/test/SemaCUDA/builtins.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/async-copy.ll

Index: llvm/test/CodeGen/NVPTX/async-copy.ll
===
--- llvm/test/CodeGen/NVPTX/async-copy.ll
+++ llvm/test/CodeGen/NVPTX/async-copy.ll
@@ -1,35 +1,35 @@
-; RUN: llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=ALL,CHECK_PTX32 %s
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=ALL,CHECK_PTX64 %s
+; RUN: llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=CHECK,CHECK_PTX32 %s
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | FileCheck -check-prefixes=CHECK,CHECK_PTX64 %s
 ; RUN: %if ptxas-11.0 %{ llc < %s -march=nvptx -mcpu=sm_80 -mattr=+ptx70 | %ptxas-verify -arch=sm_80 %}
 ; RUN: %if ptxas-11.0 %{ llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 | %ptxas-verify -arch=sm_80 %}
 
 declare void @llvm.nvvm.cp.async.wait.group(i32)
 
-; ALL-LABEL: asyncwaitgroup
+; CHECK-LABEL: asyncwaitgroup
 define void @asyncwaitgroup() {
-  ; ALL: cp.async.wait_group 8;
+  ; CHECK: cp.async.wait_group 8;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 8)
-  ; ALL: cp.async.wait_group 0;
+  ; CHECK: cp.async.wait_group 0;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 0)
-  ; ALL: cp.async.wait_group 16;
+  ; CHECK: cp.async.wait_group 16;
   tail call void @llvm.nvvm.cp.async.wait.group(i32 16)
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.wait.all()
 
-; ALL-LABEL: asyncwaitall
+; CHECK-LABEL: asyncwaitall
 define void @asyncwaitall() {
-; ALL: cp.async.wait_all
+; CHECK: cp.async.wait_all
   tail call void @llvm.nvvm.cp.async.wait.all()
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.commit.group()
 
-; ALL-LABEL: asynccommitgroup
+; CHECK-LABEL: asynccommitgroup
 define void @asynccommitgroup() {
-; ALL: cp.async.commit_group
+; CHECK: cp.async.commit_group
   tail call void @llvm.nvvm.cp.async.commit.group()
   ret void
 }
@@ -41,72 +41,87 @@
 
 ; CHECK-LABEL: asyncmbarrier
 define void @asyncmbarrier(ptr %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.b64 [%rd{{[0-9]+}}];
+; The distinction between PTX32/PTX64 here is only to capture pointer register type
+; in R to be used in subsequent tests.
+; CHECK_PTX32: cp.async.mbarrier.arrive.b64 [%[[R:r]]{{[0-9]+}}];
+; CHECK_PTX64: cp.async.mbarrier.arrive.b64 [%[[R:rd]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive(ptr %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriershared
 define void @asyncmbarriershared(ptr addrspace(3) %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.shared.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.shared.b64 [%rd{{[0-9]+}}];
+; CHECK: cp.async.mbarrier.arrive.shared.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.shared(ptr addrspace(3) %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriernoinc
 define void @asyncmbarriernoinc(ptr %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.noinc.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.b64 [%rd{{[0-9]+}}];
+; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.noinc(ptr %a)
   ret void
 }
 
 ; CHECK-LABEL: asyncmbarriernoincshared
 define void @asyncmbarriernoincshared(ptr addrspace(3) %a) {
-; CHECK_PTX32: cp.async.mbarrier.arrive.noinc.shared.b64 [%r{{[0-9]+}}];
-; CHECK_PTX64: cp.async.mbarrier.arrive.noinc.shared.b64 [%rd{{[0-9]+}}];
+; CHECK: cp.async.mbarrier.arrive.noinc.shared.b64 [%[[R]]{{[0-9]+}}];
   tail call void @llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared(ptr addrspace(3) %a)
   ret void
 }
 
 declare void @llvm.nvvm.cp.async.ca.shared.global.4(ptr addrspace(3) %a, ptr addrspace(1) %b)
+declare void @llvm.nvvm.cp.async.ca.shared.global.4.s(ptr addrspace(3) %a, ptr addrspace(1) %b, i32 %c)
 
 ; CHECK-LABEL: asynccasharedglobal4i8
-define void @asynccasharedglobal4i8(ptr addrspace(3) %a, ptr addrspace(1) %b) {
-; CHECK_PTX32: cp.async.ca.shared.global [%r{{[0-9]+}}], [%r{{[0-9]+}}], 4;
-; CHECK_PTX64: cp.async.ca.shared.global [%rd{{[0-9]+}}], [%rd{{[0-9]+}}], 4;
+define void @asynccasharedglobal4i8(ptr addrspace(3) %a, ptr addrspace(1) %b, i32 %c) {
+; CHECK: cp.async.ca.shared.global [%[[R]]{{[0-9]+}}],

[PATCH] D150820: [NVPTX, CUDA] added optional src_size argument to __nvvm_cp_async*

2023-05-18 Thread Artem Belevich via Phabricator via cfe-commits

tra added a comment.

Looks like the extra intrinsic argument broke MLIR. I'll need to figure out how 
to deal with that.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150820/new/

https://reviews.llvm.org/D150820

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150894: [CUDA] provide wrapper functions for new NVCC builtins.

2023-05-18 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 523472.
tra added a comment.

Put the wrappers behind __CUDA_ARCH__ >= 800, as these clang builtins are not
available on older GPUs.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150894/new/

https://reviews.llvm.org/D150894

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,68 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_max(__mask, __value);
+}
+__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
+  unsigned __value) {
+  return __nvvm_redux_sync_or(__mask, __value);
+}
+__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_and(__mask, __value);
+}
+__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_xor(__mask, __value);
+}
+
+__device__ inline void
+__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_4(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_8(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
+unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_16(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+#endif // !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+
 } // extern "C"
 #endif // CUDA_VERSION >= 11000
 


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,68 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int

[PATCH] D150894: [CUDA] provide wrapper functions for new NVCC builtins.

2023-05-18 Thread Artem Belevich via Phabricator via cfe-commits

tra updated this revision to Diff 523466.
tra added a comment.

Prefix function args with `__`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150894/new/

https://reviews.llvm.org/D150894

Files:
  clang/lib/Headers/__clang_cuda_intrinsics.h


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,66 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_max(__mask, __value);
+}
+__device__ inline unsigned __reduce_or_sync_unsigned_impl(unsigned __mask,
+  unsigned __value) {
+  return __nvvm_redux_sync_or(__mask, __value);
+}
+__device__ inline unsigned __reduce_and_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_and(__mask, __value);
+}
+__device__ inline unsigned __reduce_xor_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_xor(__mask, __value);
+}
+
+__device__ inline void
+__nv_memcpy_async_shared_global_4_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_4(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_8_impl(void *__dst, const void *__src,
+   unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_8(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+__device__ inline void
+__nv_memcpy_async_shared_global_16_impl(void *__dst, const void *__src,
+unsigned __src_size) {
+  __nvvm_cp_async_ca_shared_global_16(
+  (void __attribute__((address_space(3))) *)__dst,
+  (const void __attribute__((address_space(1))) *)__src, __src_size);
+}
+
 } // extern "C"
 #endif // CUDA_VERSION >= 11000
 


Index: clang/lib/Headers/__clang_cuda_intrinsics.h
===
--- clang/lib/Headers/__clang_cuda_intrinsics.h
+++ clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -512,6 +512,66 @@
 __device__ inline cuuint32_t __nvvm_get_smem_pointer(void *__ptr) {
   return __nv_cvta_generic_to_shared_impl(__ptr);
 }
+
+__device__ inline unsigned __reduce_add_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline int __reduce_add_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_add(__mask, __value);
+}
+__device__ inline unsigned __reduce_min_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umin(__mask, __value);
+}
+__device__ inline unsigned __reduce_max_sync_unsigned_impl(unsigned __mask,
+   unsigned __value) {
+  return __nvvm_redux_sync_umax(__mask, __value);
+}
+__device__ inline int __reduce_min_sync_signed_impl(unsigned __mask,
+int __value) {
+  return __nvvm_redux_sync_min(__mask, __value);
+}
+__device__ inline int __reduce_max_sync_signed_impl(unsigned __mask,
+

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1828 matches

Mail list logo