Re: r236765 - [cuda] Include GPU binary into host object file and generate init/deinit code.

Artem Belevich Mon, 11 May 2015 11:36:14 -0700

It looks like on s390x strings come with 'align 2'. I'll remove align from
the CHECK constraint and that should fix the test failure on s390x. I'll
commit the fix shortly.


--Artem

On Mon, May 11, 2015 at 10:43 AM, Artem Belevich <[email protected]> wrote:

> Could you send me output of the CC1 executed by the test before it's piped
> into FileCheck?
>
> /scratch/hstong/workdir/Release+Asserts/bin/clang -cc1 -internal-isystem
> /scratch/hstong/workdir/Release+Asserts/bin/../lib/clang/3.7.0/include
> -nostdsysteminc -emit-llvm /gsa/tlbgsa-h1/08/hstong/pub/
> cfe_trunk/clang/test/CodeGenCUDA/device-stub.cu -fcuda-include-gpubinary
> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
> device-stub.cu -o -
>
> Oh, and I see a typo in the script -- "CHEKC: call{{.*}}kernelfunc",
> though it's probably not what breaks the test in your case.
>
> --Artem
>
> On Sat, May 9, 2015 at 11:10 AM, Hubert Tong <
> [email protected]> wrote:
>
>> Hi Artem,
>>
>> I am encountering a failure with device-stub.cu on s390x-suse-linux. Can
>> you take a look?
>>
>> *Output:*
>> FAIL: Clang :: CodeGenCUDA/device-stub.cu (1986 of 21893)
>> ******************** TEST 'Clang :: CodeGenCUDA/device-stub.cu' FAILED
>> ********************
>> Script:
>> --
>> /scratch/hstong/workdir/Release+Asserts/bin/clang -cc1 -internal-isystem
>> /scratch/hstong/workdir/Release+Asserts/bin/../lib/clang/3.7.0/include
>> -nostdsysteminc -emit-llvm
>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>> device-stub.cu -fcuda-include-gpubinary
>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>> device-stub.cu -o - |
>> /scratch/hstong/workdir/Release+Asserts/bin/FileCheck
>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>> device-stub.cu
>> --
>> Exit Code: 1
>>
>> Command Output (stderr):
>> --
>>
>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/device-stub.cu:7:11:
>> error: expected string not found in input
>> // CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00", align 1
>>           ^
>> <stdin>:1:1: note: scanning from here
>> ; ModuleID =
>> '/gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>> device-stub.cu'
>> ^
>> <stdin>:13:298: note: possible intended match here
>> @1 = private unnamed_addr constant [2259 x i8] c"// RUN: %clang_cc1
>> -emit-llvm %s -fcuda-include-gpubinary %s -o - | FileCheck %s\0A\0A#include
>> \22Inputs/cuda.h\22\0A\0A// Make sure that all parts of GPU code
>> init/cleanup are there:\0A// * constant unnamed string with the kernel
>> name\0A// CHECK: private unnamed_addr
>> constant{{.*}}kernelfunc{{.*}}\5C00\22, align 1\0A// * constant unnamed
>> string with GPU binary\0A// CHECK: private unnamed_addr
>> constant{{.*}}\5C00\22\0A// * constant struct that wraps GPU binary\0A//
>> CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*, i8* }
>> \0A// CHECK: { i32 1180844977, i32 1, {{.*}}, i8* null }\0A// * variable to
>> save GPU binary handle after initialization\0A// CHECK:
>> @__cuda_gpubin_handle = internal global i8** null\0A// * Make sure our
>> constructor/destructor was added to global ctor/dtor list.\0A// CHECK:
>> @llvm.global_ctors = appending global {{.*}}@__cuda_module_ctor\0A// CHECK:
>> @llvm.global_dtors = appending global {{.*}}@__cuda_module_dtor\0A\0A//
>> Test that we build the correct number of calls to cudaSetupArgument
>> followed\0A// by a call to cudaLaunch.\0A\0A// CHECK:
>> define{{.*}}kernelfunc\0A// CHECK: call{{.*}}cudaSetupArgument\0A// CHECK:
>> call{{.*}}cudaSetupArgument\0A// CHECK: call{{.*}}cudaSetupArgument\0A//
>> CHECK: call{{.*}}cudaLaunch\0A__global__ void kernelfunc(int i, int j, int
>> k) {}\0A\0A// Test that we've built correct kernel launch sequence.\0A//
>> CHECK: define{{.*}}hostfunc\0A// CHECK: call{{.*}}cudaConfigureCall\0A//
>> CHEKC: call{{.*}}kernelfunc\0Avoid hostfunc(void) { kernelfunc<<<1, 1>>>(1,
>> 1, 1); }\0A\0A// Test that we've built a function to register kernels\0A//
>> CHECK: define internal void @__cuda_register_kernels\0A// CHECK:
>> call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc\0A\0A// Test that
>> we've built contructor..\0A// CHECK: define internal void
>> @__cuda_module_ctor\0A// .. that calls
>> __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)\0A// CHECK:
>> call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper\0A// .. stores
>> return value in __cuda_gpubin_handle\0A// CHECK-NEXT:
>> store{{.*}}__cuda_gpubin_handle\0A// .. and then calls
>> __cuda_register_kernels\0A// CHECK-NEXT: call void
>> @__cuda_register_kernels\0A\0A// Test that we've created destructor.\0A//
>> CHECK: define internal void @__cuda_module_dtor\0A// CHECK:
>> load{{.*}}__cuda_gpubin_handle\0A// CHECK-NEXT: call void
>> @__cudaUnregisterFatBinary\0A\0A\00", align 2
>>
>> ^
>>
>> --
>>
>> ********************
>>
>> *Build environment info:*
>> > g++ -v
>> Using built-in specs.
>> COLLECT_GCC=g++
>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/s390x-suse-linux/4.8/lto-wrapper
>> Target: s390x-suse-linux
>> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
>> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
>> --enable-languages=c,c++,objc,fortran,obj-c++,java
>> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
>> --enable-ssp --disable-libssp --disable-plugin --with-bugurl=
>> http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
>> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
>> --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new
>> --disable-libstdcxx-pch --enable-version-specific-runtime-libs
>> --enable-linker-build-id --enable-linux-futex --program-suffix=-4.8
>> --without-system-libunwind --with-tune=zEC12 --with-arch=z196
>> --with-long-double-128 --enable-decimal-float --build=s390x-suse-linux
>> --host=s390x-suse-linux
>> Thread model: posix
>> gcc version 4.8.3 20140627 [gcc-4_8-branch revision 212064] (SUSE Linux)
>>
>> Thanks,
>>
>>
>> Hubert Tong
>>
>> On Thu, May 7, 2015 at 2:34 PM, Artem Belevich <[email protected]> wrote:
>>
>>> Author: tra
>>> Date: Thu May  7 14:34:16 2015
>>> New Revision: 236765
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=236765&view=rev
>>> Log:
>>> [cuda] Include GPU binary into host object file and generate init/deinit
>>> code.
>>>
>>> - added -fcuda-include-gpubinary option to incorporate results of
>>>   device-side compilation into host-side one.
>>> - generate code to register GPU binaries and associated kernels
>>>   with CUDA runtime and clean-up on exit.
>>> - added test case for init/deinit code generation.
>>>
>>> Differential Revision: http://reviews.llvm.org/D9507
>>>
>>> Modified:
>>>     cfe/trunk/include/clang/Driver/CC1Options.td
>>>     cfe/trunk/include/clang/Frontend/CodeGenOptions.h
>>>     cfe/trunk/lib/CodeGen/CGCUDANV.cpp
>>>     cfe/trunk/lib/CodeGen/CGCUDARuntime.h
>>>     cfe/trunk/lib/CodeGen/CodeGenFunction.cpp
>>>     cfe/trunk/lib/CodeGen/CodeGenModule.cpp
>>>     cfe/trunk/lib/Frontend/CompilerInvocation.cpp
>>>     cfe/trunk/test/CodeGenCUDA/device-stub.cu
>>>
>>> Modified: cfe/trunk/include/clang/Driver/CC1Options.td
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Driver/CC1Options.td?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/include/clang/Driver/CC1Options.td (original)
>>> +++ cfe/trunk/include/clang/Driver/CC1Options.td Thu May  7 14:34:16 2015
>>> @@ -631,6 +631,8 @@ def fcuda_allow_host_calls_from_host_dev
>>>  def fcuda_disable_target_call_checks : Flag<["-"],
>>>      "fcuda-disable-target-call-checks">,
>>>    HelpText<"Disable all cross-target (host, device, etc.) call checks
>>> in CUDA">;
>>> +def fcuda_include_gpubinary : Separate<["-"],
>>> "fcuda-include-gpubinary">,
>>> +  HelpText<"Incorporate CUDA device-side binary into host object
>>> file.">;
>>>
>>>  } // let Flags = [CC1Option]
>>>
>>>
>>> Modified: cfe/trunk/include/clang/Frontend/CodeGenOptions.h
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Frontend/CodeGenOptions.h?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/include/clang/Frontend/CodeGenOptions.h (original)
>>> +++ cfe/trunk/include/clang/Frontend/CodeGenOptions.h Thu May  7
>>> 14:34:16 2015
>>> @@ -163,6 +163,11 @@ public:
>>>    /// Name of the profile file to use as input for -fprofile-instr-use
>>>    std::string InstrProfileInput;
>>>
>>> +  /// A list of file names passed with -fcuda-include-gpubinary options
>>> to
>>> +  /// forward to CUDA runtime back-end for incorporating them into
>>> host-side
>>> +  /// object file.
>>> +  std::vector<std::string> CudaGpuBinaryFileNames;
>>> +
>>>    /// Regular expression to select optimizations for which we should
>>> enable
>>>    /// optimization remarks. Transformation passes whose name matches
>>> this
>>>    /// expression (and support this feature), will emit a diagnostic
>>>
>>> Modified: cfe/trunk/lib/CodeGen/CGCUDANV.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDANV.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/lib/CodeGen/CGCUDANV.cpp (original)
>>> +++ cfe/trunk/lib/CodeGen/CGCUDANV.cpp Thu May  7 14:34:16 2015
>>> @@ -20,7 +20,6 @@
>>>  #include "llvm/IR/CallSite.h"
>>>  #include "llvm/IR/Constants.h"
>>>  #include "llvm/IR/DerivedTypes.h"
>>> -#include <vector>
>>>
>>>  using namespace clang;
>>>  using namespace CodeGen;
>>> @@ -30,29 +29,66 @@ namespace {
>>>  class CGNVCUDARuntime : public CGCUDARuntime {
>>>
>>>  private:
>>> -  llvm::Type *IntTy, *SizeTy;
>>> -  llvm::PointerType *CharPtrTy, *VoidPtrTy;
>>> +  llvm::Type *IntTy, *SizeTy, *VoidTy;
>>> +  llvm::PointerType *CharPtrTy, *VoidPtrTy, *VoidPtrPtrTy;
>>> +
>>> +  /// Convenience reference to LLVM Context
>>> +  llvm::LLVMContext &Context;
>>> +  /// Convenience reference to the current module
>>> +  llvm::Module &TheModule;
>>> +  /// Keeps track of kernel launch stubs emitted in this module
>>> +  llvm::SmallVector<llvm::Function *, 16> EmittedKernels;
>>> +  /// Keeps track of variables containing handles of GPU binaries.
>>> Populated by
>>> +  /// ModuleCtorFunction() and used to create corresponding cleanup
>>> calls in
>>> +  /// ModuleDtorFunction()
>>> +  llvm::SmallVector<llvm::GlobalVariable *, 16> GpuBinaryHandles;
>>>
>>>    llvm::Constant *getSetupArgumentFn() const;
>>>    llvm::Constant *getLaunchFn() const;
>>>
>>> +  /// Creates a function to register all kernel stubs generated in this
>>> module.
>>> +  llvm::Function *makeRegisterKernelsFn();
>>> +
>>> +  /// Helper function that generates a constant string and returns a
>>> pointer to
>>> +  /// the start of the string.  The result of this function can be used
>>> anywhere
>>> +  /// where the C code specifies const char*.
>>> +  llvm::Constant *makeConstantString(const std::string &Str,
>>> +                                     const std::string &Name = "",
>>> +                                     unsigned Alignment = 0) {
>>> +    llvm::Constant *Zeros[] = {llvm::ConstantInt::get(SizeTy, 0),
>>> +                               llvm::ConstantInt::get(SizeTy, 0)};
>>> +    auto *ConstStr = CGM.GetAddrOfConstantCString(Str, Name.c_str());
>>> +    return
>>> llvm::ConstantExpr::getGetElementPtr(ConstStr->getValueType(),
>>> +                                                ConstStr, Zeros);
>>> + }
>>> +
>>> +  void emitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args);
>>> +
>>>  public:
>>>    CGNVCUDARuntime(CodeGenModule &CGM);
>>>
>>> -  void EmitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args)
>>> override;
>>> +  void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args)
>>> override;
>>> +  /// Creates module constructor function
>>> +  llvm::Function *makeModuleCtorFunction() override;
>>> +  /// Creates module destructor function
>>> +  llvm::Function *makeModuleDtorFunction() override;
>>>  };
>>>
>>>  }
>>>
>>> -CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM) :
>>> CGCUDARuntime(CGM) {
>>> +CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)
>>> +    : CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),
>>> +      TheModule(CGM.getModule()) {
>>>    CodeGen::CodeGenTypes &Types = CGM.getTypes();
>>>    ASTContext &Ctx = CGM.getContext();
>>>
>>>    IntTy = Types.ConvertType(Ctx.IntTy);
>>>    SizeTy = Types.ConvertType(Ctx.getSizeType());
>>> +  VoidTy = llvm::Type::getVoidTy(Context);
>>>
>>>    CharPtrTy =
>>> llvm::PointerType::getUnqual(Types.ConvertType(Ctx.CharTy));
>>>    VoidPtrTy = cast<llvm::PointerType>(Types.ConvertType(Ctx.VoidPtrTy));
>>> +  VoidPtrPtrTy = VoidPtrTy->getPointerTo();
>>>  }
>>>
>>>  llvm::Constant *CGNVCUDARuntime::getSetupArgumentFn() const {
>>> @@ -68,14 +104,17 @@ llvm::Constant *CGNVCUDARuntime::getSetu
>>>
>>>  llvm::Constant *CGNVCUDARuntime::getLaunchFn() const {
>>>    // cudaError_t cudaLaunch(char *)
>>> -  std::vector<llvm::Type*> Params;
>>> -  Params.push_back(CharPtrTy);
>>> -  return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,
>>> -                                                           Params,
>>> false),
>>> -                                   "cudaLaunch");
>>> +  return CGM.CreateRuntimeFunction(
>>> +      llvm::FunctionType::get(IntTy, CharPtrTy, false), "cudaLaunch");
>>> +}
>>> +
>>> +void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
>>> +                                     FunctionArgList &Args) {
>>> +  EmittedKernels.push_back(CGF.CurFn);
>>> +  emitDeviceStubBody(CGF, Args);
>>>  }
>>>
>>> -void CGNVCUDARuntime::EmitDeviceStubBody(CodeGenFunction &CGF,
>>> +void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
>>>                                           FunctionArgList &Args) {
>>>    // Build the argument value list and the argument stack struct type.
>>>    SmallVector<llvm::Value *, 16> ArgValues;
>>> @@ -87,8 +126,7 @@ void CGNVCUDARuntime::EmitDeviceStubBody
>>>      assert(isa<llvm::PointerType>(V->getType()) && "Arg type not
>>> PointerType");
>>>
>>>  
>>> ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());
>>>    }
>>> -  llvm::StructType *ArgStackTy = llvm::StructType::get(
>>> -      CGF.getLLVMContext(), ArgTypes);
>>> +  llvm::StructType *ArgStackTy = llvm::StructType::get(Context,
>>> ArgTypes);
>>>
>>>    llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");
>>>
>>> @@ -120,6 +158,160 @@ void CGNVCUDARuntime::EmitDeviceStubBody
>>>    CGF.EmitBlock(EndBlock);
>>>  }
>>>
>>> +/// Creates internal function to register all kernel stubs generated in
>>> this
>>> +/// module with the CUDA runtime.
>>> +/// \code
>>> +/// void __cuda_register_kernels(void** GpuBinaryHandle) {
>>> +///    __cudaRegisterFunction(GpuBinaryHandle,Kernel0,...);
>>> +///    ...
>>> +///    __cudaRegisterFunction(GpuBinaryHandle,KernelM,...);
>>> +/// }
>>> +/// \endcode
>>> +llvm::Function *CGNVCUDARuntime::makeRegisterKernelsFn() {
>>> +  llvm::Function *RegisterKernelsFunc = llvm::Function::Create(
>>> +      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_register_kernels",
>>> &TheModule);
>>> +  llvm::BasicBlock *EntryBB =
>>> +      llvm::BasicBlock::Create(Context, "entry", RegisterKernelsFunc);
>>> +  CGBuilderTy Builder(Context);
>>> +  Builder.SetInsertPoint(EntryBB);
>>> +
>>> +  // void __cudaRegisterFunction(void **, const char *, char *, const
>>> char *,
>>> +  //                             int, uint3*, uint3*, dim3*, dim3*,
>>> int*)
>>> +  std::vector<llvm::Type *> RegisterFuncParams = {
>>> +      VoidPtrPtrTy, CharPtrTy, CharPtrTy, CharPtrTy, IntTy,
>>> +      VoidPtrTy,    VoidPtrTy, VoidPtrTy, VoidPtrTy,
>>> IntTy->getPointerTo()};
>>> +  llvm::Constant *RegisterFunc = CGM.CreateRuntimeFunction(
>>> +      llvm::FunctionType::get(IntTy, RegisterFuncParams, false),
>>> +      "__cudaRegisterFunction");
>>> +
>>> +  // Extract GpuBinaryHandle passed as the first argument passed to
>>> +  // __cuda_register_kernels() and generate __cudaRegisterFunction()
>>> call for
>>> +  // each emitted kernel.
>>> +  llvm::Argument &GpuBinaryHandlePtr =
>>> *RegisterKernelsFunc->arg_begin();
>>> +  for (llvm::Function *Kernel : EmittedKernels) {
>>> +    llvm::Constant *KernelName = makeConstantString(Kernel->getName());
>>> +    llvm::Constant *NullPtr = llvm::ConstantPointerNull::get(VoidPtrTy);
>>> +    llvm::Value *args[] = {
>>> +        &GpuBinaryHandlePtr, Builder.CreateBitCast(Kernel, VoidPtrTy),
>>> +        KernelName, KernelName, llvm::ConstantInt::get(IntTy, -1),
>>> NullPtr,
>>> +        NullPtr, NullPtr, NullPtr,
>>> +        llvm::ConstantPointerNull::get(IntTy->getPointerTo())};
>>> +    Builder.CreateCall(RegisterFunc, args);
>>> +  }
>>> +
>>> +  Builder.CreateRetVoid();
>>> +  return RegisterKernelsFunc;
>>> +}
>>> +
>>> +/// Creates a global constructor function for the module:
>>> +/// \code
>>> +/// void __cuda_module_ctor(void*) {
>>> +///     Handle0 = __cudaRegisterFatBinary(GpuBinaryBlob0);
>>> +///     __cuda_register_kernels(Handle0);
>>> +///     ...
>>> +///     HandleN = __cudaRegisterFatBinary(GpuBinaryBlobN);
>>> +///     __cuda_register_kernels(HandleN);
>>> +/// }
>>> +/// \endcode
>>> +llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
>>> +  // void __cuda_register_kernels(void* handle);
>>> +  llvm::Function *RegisterKernelsFunc = makeRegisterKernelsFn();
>>> +  // void ** __cudaRegisterFatBinary(void *);
>>> +  llvm::Constant *RegisterFatbinFunc = CGM.CreateRuntimeFunction(
>>> +      llvm::FunctionType::get(VoidPtrPtrTy, VoidPtrTy, false),
>>> +      "__cudaRegisterFatBinary");
>>> +  // struct { int magic, int version, void * gpu_binary, void *
>>> dont_care };
>>> +  llvm::StructType *FatbinWrapperTy =
>>> +      llvm::StructType::get(IntTy, IntTy, VoidPtrTy, VoidPtrTy,
>>> nullptr);
>>> +
>>> +  llvm::Function *ModuleCtorFunc = llvm::Function::Create(
>>> +      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_module_ctor",
>>> &TheModule);
>>> +  llvm::BasicBlock *CtorEntryBB =
>>> +      llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);
>>> +  CGBuilderTy CtorBuilder(Context);
>>> +
>>> +  CtorBuilder.SetInsertPoint(CtorEntryBB);
>>> +
>>> +  // For each GPU binary, register it with the CUDA runtime and store
>>> returned
>>> +  // handle in a global variable and save the handle in
>>> GpuBinaryHandles vector
>>> +  // to be cleaned up in destructor on exit. Then associate all known
>>> kernels
>>> +  // with the GPU binary handle so CUDA runtime can figure out what to
>>> call on
>>> +  // the GPU side.
>>> +  for (const std::string &GpuBinaryFileName :
>>> +       CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
>>> +    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
>>> +        llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
>>> +    if (std::error_code EC = GpuBinaryOrErr.getError()) {
>>> +      CGM.getDiags().Report(diag::err_cannot_open_file) <<
>>> GpuBinaryFileName
>>> +                                                        << EC.message();
>>> +      continue;
>>> +    }
>>> +
>>> +    // Create initialized wrapper structure that points to the loaded
>>> GPU binary
>>> +    llvm::Constant *Values[] = {
>>> +        llvm::ConstantInt::get(IntTy, 0x466243b1), // Fatbin wrapper
>>> magic.
>>> +        llvm::ConstantInt::get(IntTy, 1),          // Fatbin version.
>>> +        makeConstantString(GpuBinaryOrErr.get()->getBuffer(), "", 16),
>>> // Data.
>>> +        llvm::ConstantPointerNull::get(VoidPtrTy)}; // Unused in fatbin
>>> v1.
>>> +    llvm::GlobalVariable *FatbinWrapper = new llvm::GlobalVariable(
>>> +        TheModule, FatbinWrapperTy, true,
>>> llvm::GlobalValue::InternalLinkage,
>>> +        llvm::ConstantStruct::get(FatbinWrapperTy, Values),
>>> +        "__cuda_fatbin_wrapper");
>>> +
>>> +    // GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
>>> +    llvm::CallInst *RegisterFatbinCall = CtorBuilder.CreateCall(
>>> +        RegisterFatbinFunc,
>>> +        CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy));
>>> +    llvm::GlobalVariable *GpuBinaryHandle = new llvm::GlobalVariable(
>>> +        TheModule, VoidPtrPtrTy, false,
>>> llvm::GlobalValue::InternalLinkage,
>>> +        llvm::ConstantPointerNull::get(VoidPtrPtrTy),
>>> "__cuda_gpubin_handle");
>>> +    CtorBuilder.CreateStore(RegisterFatbinCall, GpuBinaryHandle, false);
>>> +
>>> +    // Call __cuda_register_kernels(GpuBinaryHandle);
>>> +    CtorBuilder.CreateCall(RegisterKernelsFunc, RegisterFatbinCall);
>>> +
>>> +    // Save GpuBinaryHandle so we can unregister it in destructor.
>>> +    GpuBinaryHandles.push_back(GpuBinaryHandle);
>>> +  }
>>> +
>>> +  CtorBuilder.CreateRetVoid();
>>> +  return ModuleCtorFunc;
>>> +}
>>> +
>>> +/// Creates a global destructor function that unregisters all GPU code
>>> blobs
>>> +/// registered by constructor.
>>> +/// \code
>>> +/// void __cuda_module_dtor(void*) {
>>> +///     __cudaUnregisterFatBinary(Handle0);
>>> +///     ...
>>> +///     __cudaUnregisterFatBinary(HandleN);
>>> +/// }
>>> +/// \endcode
>>> +llvm::Function *CGNVCUDARuntime::makeModuleDtorFunction() {
>>> +  // void __cudaUnregisterFatBinary(void ** handle);
>>> +  llvm::Constant *UnregisterFatbinFunc = CGM.CreateRuntimeFunction(
>>> +      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
>>> +      "__cudaUnregisterFatBinary");
>>> +
>>> +  llvm::Function *ModuleDtorFunc = llvm::Function::Create(
>>> +      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_module_dtor",
>>> &TheModule);
>>> +  llvm::BasicBlock *DtorEntryBB =
>>> +      llvm::BasicBlock::Create(Context, "entry", ModuleDtorFunc);
>>> +  CGBuilderTy DtorBuilder(Context);
>>> +  DtorBuilder.SetInsertPoint(DtorEntryBB);
>>> +
>>> +  for (llvm::GlobalVariable *GpuBinaryHandle : GpuBinaryHandles) {
>>> +    DtorBuilder.CreateCall(UnregisterFatbinFunc,
>>> +                           DtorBuilder.CreateLoad(GpuBinaryHandle,
>>> false));
>>> +  }
>>> +
>>> +  DtorBuilder.CreateRetVoid();
>>> +  return ModuleDtorFunc;
>>> +}
>>> +
>>>  CGCUDARuntime *CodeGen::CreateNVCUDARuntime(CodeGenModule &CGM) {
>>>    return new CGNVCUDARuntime(CGM);
>>>  }
>>>
>>> Modified: cfe/trunk/lib/CodeGen/CGCUDARuntime.h
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDARuntime.h?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/lib/CodeGen/CGCUDARuntime.h (original)
>>> +++ cfe/trunk/lib/CodeGen/CGCUDARuntime.h Thu May  7 14:34:16 2015
>>> @@ -16,6 +16,10 @@
>>>  #ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
>>>  #define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
>>>
>>> +namespace llvm {
>>> +class Function;
>>> +}
>>> +
>>>  namespace clang {
>>>
>>>  class CUDAKernelCallExpr;
>>> @@ -39,10 +43,17 @@ public:
>>>    virtual RValue EmitCUDAKernelCallExpr(CodeGenFunction &CGF,
>>>                                          const CUDAKernelCallExpr *E,
>>>                                          ReturnValueSlot ReturnValue);
>>> -
>>> -  virtual void EmitDeviceStubBody(CodeGenFunction &CGF,
>>> -                                  FunctionArgList &Args) = 0;
>>>
>>> +  /// Emits a kernel launch stub.
>>> +  virtual void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList
>>> &Args) = 0;
>>> +
>>> +  /// Constructs and returns a module initialization function or
>>> nullptr if it's
>>> +  /// not needed. Must be called after all kernels have been emitted.
>>> +  virtual llvm::Function *makeModuleCtorFunction() = 0;
>>> +
>>> +  /// Returns a module cleanup function or nullptr if it's not needed.
>>> +  /// Must be called after ModuleCtorFunction
>>> +  virtual llvm::Function *makeModuleDtorFunction() = 0;
>>>  };
>>>
>>>  /// Creates an instance of a CUDA runtime class.
>>>
>>> Modified: cfe/trunk/lib/CodeGen/CodeGenFunction.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenFunction.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/lib/CodeGen/CodeGenFunction.cpp (original)
>>> +++ cfe/trunk/lib/CodeGen/CodeGenFunction.cpp Thu May  7 14:34:16 2015
>>> @@ -878,7 +878,7 @@ void CodeGenFunction::GenerateCode(Globa
>>>    else if (getLangOpts().CUDA &&
>>>             !getLangOpts().CUDAIsDevice &&
>>>             FD->hasAttr<CUDAGlobalAttr>())
>>> -    CGM.getCUDARuntime().EmitDeviceStubBody(*this, Args);
>>> +    CGM.getCUDARuntime().emitDeviceStub(*this, Args);
>>>    else if (isa<CXXConversionDecl>(FD) &&
>>>
>>> cast<CXXConversionDecl>(FD)->isLambdaToBlockPointerConversion()) {
>>>      // The lambda conversion to block pointer is special; the semantics
>>> can't be
>>>
>>> Modified: cfe/trunk/lib/CodeGen/CodeGenModule.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenModule.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/lib/CodeGen/CodeGenModule.cpp (original)
>>> +++ cfe/trunk/lib/CodeGen/CodeGenModule.cpp Thu May  7 14:34:16 2015
>>> @@ -350,6 +350,13 @@ void CodeGenModule::Release() {
>>>    if (ObjCRuntime)
>>>      if (llvm::Function *ObjCInitFunction =
>>> ObjCRuntime->ModuleInitFunction())
>>>        AddGlobalCtor(ObjCInitFunction);
>>> +  if (Context.getLangOpts().CUDA && !Context.getLangOpts().CUDAIsDevice
>>> &&
>>> +      CUDARuntime) {
>>> +    if (llvm::Function *CudaCtorFunction =
>>> CUDARuntime->makeModuleCtorFunction())
>>> +      AddGlobalCtor(CudaCtorFunction);
>>> +    if (llvm::Function *CudaDtorFunction =
>>> CUDARuntime->makeModuleDtorFunction())
>>> +      AddGlobalDtor(CudaDtorFunction);
>>> +  }
>>>    if (PGOReader && PGOStats.hasDiagnostics())
>>>      PGOStats.reportDiagnostics(getDiags(),
>>> getCodeGenOpts().MainFileName);
>>>    EmitCtorList(GlobalCtors, "llvm.global_ctors");
>>> @@ -3678,4 +3685,3 @@ void CodeGenModule::EmitOMPThreadPrivate
>>>        CXXGlobalInits.push_back(InitFunction);
>>>    }
>>>  }
>>> -
>>>
>>> Modified: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInvocation.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/lib/Frontend/CompilerInvocation.cpp (original)
>>> +++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp Thu May  7 14:34:16
>>> 2015
>>> @@ -651,6 +651,9 @@ static bool ParseCodeGenArgs(CodeGenOpti
>>>                        Args.getAllArgValues(OPT_fsanitize_recover_EQ),
>>> Diags,
>>>                        Opts.SanitizeRecover);
>>>
>>> +  Opts.CudaGpuBinaryFileNames =
>>> +      Args.getAllArgValues(OPT_fcuda_include_gpubinary);
>>> +
>>>    return Success;
>>>  }
>>>
>>>
>>> Modified: cfe/trunk/test/CodeGenCUDA/device-stub.cu
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/test/CodeGenCUDA/device-stub.cu?rev=236765&r1=236764&r2=236765&view=diff
>>>
>>> ==============================================================================
>>> --- cfe/trunk/test/CodeGenCUDA/device-stub.cu (original)
>>> +++ cfe/trunk/test/CodeGenCUDA/device-stub.cu Thu May  7 14:34:16 2015
>>> @@ -1,7 +1,21 @@
>>> -// RUN: %clang_cc1 -emit-llvm %s -o - | FileCheck %s
>>> +// RUN: %clang_cc1 -emit-llvm %s -fcuda-include-gpubinary %s -o - |
>>> FileCheck %s
>>>
>>>  #include "Inputs/cuda.h"
>>>
>>> +// Make sure that all parts of GPU code init/cleanup are there:
>>> +// * constant unnamed string with the kernel name
>>> +// CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00",
>>> align 1
>>> +// * constant unnamed string with GPU binary
>>> +// CHECK: private unnamed_addr constant{{.*}}\00"
>>> +// * constant struct that wraps GPU binary
>>> +// CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*,
>>> i8* }
>>> +// CHECK:       { i32 1180844977, i32 1, {{.*}}, i64 0, i64 0), i8*
>>> null }
>>> +// * variable to save GPU binary handle after initialization
>>> +// CHECK: @__cuda_gpubin_handle = internal global i8** null
>>> +// * Make sure our constructor/destructor was added to global ctor/dtor
>>> list.
>>> +// CHECK: @llvm.global_ctors = appending global
>>> {{.*}}@__cuda_module_ctor
>>> +// CHECK: @llvm.global_dtors = appending global
>>> {{.*}}@__cuda_module_dtor
>>> +
>>>  // Test that we build the correct number of calls to cudaSetupArgument
>>> followed
>>>  // by a call to cudaLaunch.
>>>
>>> @@ -11,3 +25,28 @@
>>>  // CHECK: call{{.*}}cudaSetupArgument
>>>  // CHECK: call{{.*}}cudaLaunch
>>>  __global__ void kernelfunc(int i, int j, int k) {}
>>> +
>>> +// Test that we've built correct kernel launch sequence.
>>> +// CHECK: define{{.*}}hostfunc
>>> +// CHECK: call{{.*}}cudaConfigureCall
>>> +// CHEKC: call{{.*}}kernelfunc
>>> +void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }
>>> +
>>> +// Test that we've built a function to register kernels
>>> +// CHECK: define internal void @__cuda_register_kernels
>>> +// CHECK: call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc
>>> +
>>> +// Test that we've built contructor..
>>> +// CHECK: define internal void @__cuda_module_ctor
>>> +//   .. that calls __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)
>>> +// CHECK: call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper
>>> +//   .. stores return value in __cuda_gpubin_handle
>>> +// CHECK-NEXT: store{{.*}}__cuda_gpubin_handle
>>> +//   .. and then calls __cuda_register_kernels
>>> +// CHECK-NEXT: call void @__cuda_register_kernels
>>> +
>>> +// Test that we've created destructor.
>>> +// CHECK: define internal void @__cuda_module_dtor
>>> +// CHECK: load{{.*}}__cuda_gpubin_handle
>>> +// CHECK-NEXT: call void @__cudaUnregisterFatBinary
>>> +
>>>
>>>
>>> _______________________________________________
>>> cfe-commits mailing list
>>> [email protected]
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>
>>
>>
>
>
> --
> --Artem Belevich
>



-- 
--Artem Belevich

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: r236765 - [cuda] Include GPU binary into host object file and generate init/deinit code.

Reply via email to