Re: r236765 - [cuda] Include GPU binary into host object file and generate init/deinit code.

Artem Belevich Mon, 11 May 2015 11:44:09 -0700

Please check whether r237007 fixes the failure for you.

--Artem


On Mon, May 11, 2015 at 11:34 AM, Artem Belevich <[email protected]> wrote:

> It looks like on s390x strings come with 'align 2'. I'll remove align from
> the CHECK constraint and that should fix the test failure on s390x. I'll
> commit the fix shortly.
>
> --Artem
>
> On Mon, May 11, 2015 at 10:43 AM, Artem Belevich <[email protected]> wrote:
>
>> Could you send me output of the CC1 executed by the test before it's
>> piped into FileCheck?
>>
>> /scratch/hstong/workdir/Release+Asserts/bin/clang -cc1 -internal-isystem
>> /scratch/hstong/workdir/Release+Asserts/bin/../lib/clang/3.7.0/include
>> -nostdsysteminc -emit-llvm /gsa/tlbgsa-h1/08/hstong/pub/
>> cfe_trunk/clang/test/CodeGenCUDA/device-stub.cu -fcuda-include-gpubinary
>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>> device-stub.cu -o -
>>
>> Oh, and I see a typo in the script -- "CHEKC: call{{.*}}kernelfunc",
>> though it's probably not what breaks the test in your case.
>>
>> --Artem
>>
>> On Sat, May 9, 2015 at 11:10 AM, Hubert Tong <
>> [email protected]> wrote:
>>
>>> Hi Artem,
>>>
>>> I am encountering a failure with device-stub.cu on s390x-suse-linux.
>>> Can you take a look?
>>>
>>> *Output:*
>>> FAIL: Clang :: CodeGenCUDA/device-stub.cu (1986 of 21893)
>>> ******************** TEST 'Clang :: CodeGenCUDA/device-stub.cu' FAILED
>>> ********************
>>> Script:
>>> --
>>> /scratch/hstong/workdir/Release+Asserts/bin/clang -cc1 -internal-isystem
>>> /scratch/hstong/workdir/Release+Asserts/bin/../lib/clang/3.7.0/include
>>> -nostdsysteminc -emit-llvm
>>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>>> device-stub.cu -fcuda-include-gpubinary
>>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>>> device-stub.cu -o - |
>>> /scratch/hstong/workdir/Release+Asserts/bin/FileCheck
>>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>>> device-stub.cu
>>> --
>>> Exit Code: 1
>>>
>>> Command Output (stderr):
>>> --
>>>
>>> /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/device-stub.cu:7:11:
>>> error: expected string not found in input
>>> // CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00", align
>>> 1
>>>           ^
>>> <stdin>:1:1: note: scanning from here
>>> ; ModuleID =
>>> '/gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/
>>> device-stub.cu'
>>> ^
>>> <stdin>:13:298: note: possible intended match here
>>> @1 = private unnamed_addr constant [2259 x i8] c"// RUN: %clang_cc1
>>> -emit-llvm %s -fcuda-include-gpubinary %s -o - | FileCheck %s\0A\0A#include
>>> \22Inputs/cuda.h\22\0A\0A// Make sure that all parts of GPU code
>>> init/cleanup are there:\0A// * constant unnamed string with the kernel
>>> name\0A// CHECK: private unnamed_addr
>>> constant{{.*}}kernelfunc{{.*}}\5C00\22, align 1\0A// * constant unnamed
>>> string with GPU binary\0A// CHECK: private unnamed_addr
>>> constant{{.*}}\5C00\22\0A// * constant struct that wraps GPU binary\0A//
>>> CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*, i8* }
>>> \0A// CHECK: { i32 1180844977, i32 1, {{.*}}, i8* null }\0A// * variable to
>>> save GPU binary handle after initialization\0A// CHECK:
>>> @__cuda_gpubin_handle = internal global i8** null\0A// * Make sure our
>>> constructor/destructor was added to global ctor/dtor list.\0A// CHECK:
>>> @llvm.global_ctors = appending global {{.*}}@__cuda_module_ctor\0A// CHECK:
>>> @llvm.global_dtors = appending global {{.*}}@__cuda_module_dtor\0A\0A//
>>> Test that we build the correct number of calls to cudaSetupArgument
>>> followed\0A// by a call to cudaLaunch.\0A\0A// CHECK:
>>> define{{.*}}kernelfunc\0A// CHECK: call{{.*}}cudaSetupArgument\0A// CHECK:
>>> call{{.*}}cudaSetupArgument\0A// CHECK: call{{.*}}cudaSetupArgument\0A//
>>> CHECK: call{{.*}}cudaLaunch\0A__global__ void kernelfunc(int i, int j, int
>>> k) {}\0A\0A// Test that we've built correct kernel launch sequence.\0A//
>>> CHECK: define{{.*}}hostfunc\0A// CHECK: call{{.*}}cudaConfigureCall\0A//
>>> CHEKC: call{{.*}}kernelfunc\0Avoid hostfunc(void) { kernelfunc<<<1, 1>>>(1,
>>> 1, 1); }\0A\0A// Test that we've built a function to register kernels\0A//
>>> CHECK: define internal void @__cuda_register_kernels\0A// CHECK:
>>> call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc\0A\0A// Test that
>>> we've built contructor..\0A// CHECK: define internal void
>>> @__cuda_module_ctor\0A// .. that calls
>>> __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)\0A// CHECK:
>>> call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper\0A// .. stores
>>> return value in __cuda_gpubin_handle\0A// CHECK-NEXT:
>>> store{{.*}}__cuda_gpubin_handle\0A// .. and then calls
>>> __cuda_register_kernels\0A// CHECK-NEXT: call void
>>> @__cuda_register_kernels\0A\0A// Test that we've created destructor.\0A//
>>> CHECK: define internal void @__cuda_module_dtor\0A// CHECK:
>>> load{{.*}}__cuda_gpubin_handle\0A// CHECK-NEXT: call void
>>> @__cudaUnregisterFatBinary\0A\0A\00", align 2
>>>
>>> ^
>>>
>>> --
>>>
>>> ********************
>>>
>>> *Build environment info:*
>>> > g++ -v
>>> Using built-in specs.
>>> COLLECT_GCC=g++
>>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/s390x-suse-linux/4.8/lto-wrapper
>>> Target: s390x-suse-linux
>>> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
>>> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
>>> --enable-languages=c,c++,objc,fortran,obj-c++,java
>>> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
>>> --enable-ssp --disable-libssp --disable-plugin --with-bugurl=
>>> http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
>>> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
>>> --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new
>>> --disable-libstdcxx-pch --enable-version-specific-runtime-libs
>>> --enable-linker-build-id --enable-linux-futex --program-suffix=-4.8
>>> --without-system-libunwind --with-tune=zEC12 --with-arch=z196
>>> --with-long-double-128 --enable-decimal-float --build=s390x-suse-linux
>>> --host=s390x-suse-linux
>>> Thread model: posix
>>> gcc version 4.8.3 20140627 [gcc-4_8-branch revision 212064] (SUSE Linux)
>>>
>>> Thanks,
>>>
>>>
>>> Hubert Tong
>>>
>>> On Thu, May 7, 2015 at 2:34 PM, Artem Belevich <[email protected]> wrote:
>>>
>>>> Author: tra
>>>> Date: Thu May  7 14:34:16 2015
>>>> New Revision: 236765
>>>>
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=236765&view=rev
>>>> Log:
>>>> [cuda] Include GPU binary into host object file and generate
>>>> init/deinit code.
>>>>
>>>> - added -fcuda-include-gpubinary option to incorporate results of
>>>>   device-side compilation into host-side one.
>>>> - generate code to register GPU binaries and associated kernels
>>>>   with CUDA runtime and clean-up on exit.
>>>> - added test case for init/deinit code generation.
>>>>
>>>> Differential Revision: http://reviews.llvm.org/D9507
>>>>
>>>> Modified:
>>>>     cfe/trunk/include/clang/Driver/CC1Options.td
>>>>     cfe/trunk/include/clang/Frontend/CodeGenOptions.h
>>>>     cfe/trunk/lib/CodeGen/CGCUDANV.cpp
>>>>     cfe/trunk/lib/CodeGen/CGCUDARuntime.h
>>>>     cfe/trunk/lib/CodeGen/CodeGenFunction.cpp
>>>>     cfe/trunk/lib/CodeGen/CodeGenModule.cpp
>>>>     cfe/trunk/lib/Frontend/CompilerInvocation.cpp
>>>>     cfe/trunk/test/CodeGenCUDA/device-stub.cu
>>>>
>>>> Modified: cfe/trunk/include/clang/Driver/CC1Options.td
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Driver/CC1Options.td?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/include/clang/Driver/CC1Options.td (original)
>>>> +++ cfe/trunk/include/clang/Driver/CC1Options.td Thu May  7 14:34:16
>>>> 2015
>>>> @@ -631,6 +631,8 @@ def fcuda_allow_host_calls_from_host_dev
>>>>  def fcuda_disable_target_call_checks : Flag<["-"],
>>>>      "fcuda-disable-target-call-checks">,
>>>>    HelpText<"Disable all cross-target (host, device, etc.) call checks
>>>> in CUDA">;
>>>> +def fcuda_include_gpubinary : Separate<["-"],
>>>> "fcuda-include-gpubinary">,
>>>> +  HelpText<"Incorporate CUDA device-side binary into host object
>>>> file.">;
>>>>
>>>>  } // let Flags = [CC1Option]
>>>>
>>>>
>>>> Modified: cfe/trunk/include/clang/Frontend/CodeGenOptions.h
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Frontend/CodeGenOptions.h?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/include/clang/Frontend/CodeGenOptions.h (original)
>>>> +++ cfe/trunk/include/clang/Frontend/CodeGenOptions.h Thu May  7
>>>> 14:34:16 2015
>>>> @@ -163,6 +163,11 @@ public:
>>>>    /// Name of the profile file to use as input for -fprofile-instr-use
>>>>    std::string InstrProfileInput;
>>>>
>>>> +  /// A list of file names passed with -fcuda-include-gpubinary
>>>> options to
>>>> +  /// forward to CUDA runtime back-end for incorporating them into
>>>> host-side
>>>> +  /// object file.
>>>> +  std::vector<std::string> CudaGpuBinaryFileNames;
>>>> +
>>>>    /// Regular expression to select optimizations for which we should
>>>> enable
>>>>    /// optimization remarks. Transformation passes whose name matches
>>>> this
>>>>    /// expression (and support this feature), will emit a diagnostic
>>>>
>>>> Modified: cfe/trunk/lib/CodeGen/CGCUDANV.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDANV.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/CodeGen/CGCUDANV.cpp (original)
>>>> +++ cfe/trunk/lib/CodeGen/CGCUDANV.cpp Thu May  7 14:34:16 2015
>>>> @@ -20,7 +20,6 @@
>>>>  #include "llvm/IR/CallSite.h"
>>>>  #include "llvm/IR/Constants.h"
>>>>  #include "llvm/IR/DerivedTypes.h"
>>>> -#include <vector>
>>>>
>>>>  using namespace clang;
>>>>  using namespace CodeGen;
>>>> @@ -30,29 +29,66 @@ namespace {
>>>>  class CGNVCUDARuntime : public CGCUDARuntime {
>>>>
>>>>  private:
>>>> -  llvm::Type *IntTy, *SizeTy;
>>>> -  llvm::PointerType *CharPtrTy, *VoidPtrTy;
>>>> +  llvm::Type *IntTy, *SizeTy, *VoidTy;
>>>> +  llvm::PointerType *CharPtrTy, *VoidPtrTy, *VoidPtrPtrTy;
>>>> +
>>>> +  /// Convenience reference to LLVM Context
>>>> +  llvm::LLVMContext &Context;
>>>> +  /// Convenience reference to the current module
>>>> +  llvm::Module &TheModule;
>>>> +  /// Keeps track of kernel launch stubs emitted in this module
>>>> +  llvm::SmallVector<llvm::Function *, 16> EmittedKernels;
>>>> +  /// Keeps track of variables containing handles of GPU binaries.
>>>> Populated by
>>>> +  /// ModuleCtorFunction() and used to create corresponding cleanup
>>>> calls in
>>>> +  /// ModuleDtorFunction()
>>>> +  llvm::SmallVector<llvm::GlobalVariable *, 16> GpuBinaryHandles;
>>>>
>>>>    llvm::Constant *getSetupArgumentFn() const;
>>>>    llvm::Constant *getLaunchFn() const;
>>>>
>>>> +  /// Creates a function to register all kernel stubs generated in
>>>> this module.
>>>> +  llvm::Function *makeRegisterKernelsFn();
>>>> +
>>>> +  /// Helper function that generates a constant string and returns a
>>>> pointer to
>>>> +  /// the start of the string.  The result of this function can be
>>>> used anywhere
>>>> +  /// where the C code specifies const char*.
>>>> +  llvm::Constant *makeConstantString(const std::string &Str,
>>>> +                                     const std::string &Name = "",
>>>> +                                     unsigned Alignment = 0) {
>>>> +    llvm::Constant *Zeros[] = {llvm::ConstantInt::get(SizeTy, 0),
>>>> +                               llvm::ConstantInt::get(SizeTy, 0)};
>>>> +    auto *ConstStr = CGM.GetAddrOfConstantCString(Str, Name.c_str());
>>>> +    return
>>>> llvm::ConstantExpr::getGetElementPtr(ConstStr->getValueType(),
>>>> +                                                ConstStr, Zeros);
>>>> + }
>>>> +
>>>> +  void emitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args);
>>>> +
>>>>  public:
>>>>    CGNVCUDARuntime(CodeGenModule &CGM);
>>>>
>>>> -  void EmitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args)
>>>> override;
>>>> +  void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args)
>>>> override;
>>>> +  /// Creates module constructor function
>>>> +  llvm::Function *makeModuleCtorFunction() override;
>>>> +  /// Creates module destructor function
>>>> +  llvm::Function *makeModuleDtorFunction() override;
>>>>  };
>>>>
>>>>  }
>>>>
>>>> -CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM) :
>>>> CGCUDARuntime(CGM) {
>>>> +CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)
>>>> +    : CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),
>>>> +      TheModule(CGM.getModule()) {
>>>>    CodeGen::CodeGenTypes &Types = CGM.getTypes();
>>>>    ASTContext &Ctx = CGM.getContext();
>>>>
>>>>    IntTy = Types.ConvertType(Ctx.IntTy);
>>>>    SizeTy = Types.ConvertType(Ctx.getSizeType());
>>>> +  VoidTy = llvm::Type::getVoidTy(Context);
>>>>
>>>>    CharPtrTy =
>>>> llvm::PointerType::getUnqual(Types.ConvertType(Ctx.CharTy));
>>>>    VoidPtrTy =
>>>> cast<llvm::PointerType>(Types.ConvertType(Ctx.VoidPtrTy));
>>>> +  VoidPtrPtrTy = VoidPtrTy->getPointerTo();
>>>>  }
>>>>
>>>>  llvm::Constant *CGNVCUDARuntime::getSetupArgumentFn() const {
>>>> @@ -68,14 +104,17 @@ llvm::Constant *CGNVCUDARuntime::getSetu
>>>>
>>>>  llvm::Constant *CGNVCUDARuntime::getLaunchFn() const {
>>>>    // cudaError_t cudaLaunch(char *)
>>>> -  std::vector<llvm::Type*> Params;
>>>> -  Params.push_back(CharPtrTy);
>>>> -  return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,
>>>> -                                                           Params,
>>>> false),
>>>> -                                   "cudaLaunch");
>>>> +  return CGM.CreateRuntimeFunction(
>>>> +      llvm::FunctionType::get(IntTy, CharPtrTy, false), "cudaLaunch");
>>>> +}
>>>> +
>>>> +void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
>>>> +                                     FunctionArgList &Args) {
>>>> +  EmittedKernels.push_back(CGF.CurFn);
>>>> +  emitDeviceStubBody(CGF, Args);
>>>>  }
>>>>
>>>> -void CGNVCUDARuntime::EmitDeviceStubBody(CodeGenFunction &CGF,
>>>> +void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
>>>>                                           FunctionArgList &Args) {
>>>>    // Build the argument value list and the argument stack struct type.
>>>>    SmallVector<llvm::Value *, 16> ArgValues;
>>>> @@ -87,8 +126,7 @@ void CGNVCUDARuntime::EmitDeviceStubBody
>>>>      assert(isa<llvm::PointerType>(V->getType()) && "Arg type not
>>>> PointerType");
>>>>
>>>>  
>>>> ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());
>>>>    }
>>>> -  llvm::StructType *ArgStackTy = llvm::StructType::get(
>>>> -      CGF.getLLVMContext(), ArgTypes);
>>>> +  llvm::StructType *ArgStackTy = llvm::StructType::get(Context,
>>>> ArgTypes);
>>>>
>>>>    llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");
>>>>
>>>> @@ -120,6 +158,160 @@ void CGNVCUDARuntime::EmitDeviceStubBody
>>>>    CGF.EmitBlock(EndBlock);
>>>>  }
>>>>
>>>> +/// Creates internal function to register all kernel stubs generated
>>>> in this
>>>> +/// module with the CUDA runtime.
>>>> +/// \code
>>>> +/// void __cuda_register_kernels(void** GpuBinaryHandle) {
>>>> +///    __cudaRegisterFunction(GpuBinaryHandle,Kernel0,...);
>>>> +///    ...
>>>> +///    __cudaRegisterFunction(GpuBinaryHandle,KernelM,...);
>>>> +/// }
>>>> +/// \endcode
>>>> +llvm::Function *CGNVCUDARuntime::makeRegisterKernelsFn() {
>>>> +  llvm::Function *RegisterKernelsFunc = llvm::Function::Create(
>>>> +      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
>>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_register_kernels",
>>>> &TheModule);
>>>> +  llvm::BasicBlock *EntryBB =
>>>> +      llvm::BasicBlock::Create(Context, "entry", RegisterKernelsFunc);
>>>> +  CGBuilderTy Builder(Context);
>>>> +  Builder.SetInsertPoint(EntryBB);
>>>> +
>>>> +  // void __cudaRegisterFunction(void **, const char *, char *, const
>>>> char *,
>>>> +  //                             int, uint3*, uint3*, dim3*, dim3*,
>>>> int*)
>>>> +  std::vector<llvm::Type *> RegisterFuncParams = {
>>>> +      VoidPtrPtrTy, CharPtrTy, CharPtrTy, CharPtrTy, IntTy,
>>>> +      VoidPtrTy,    VoidPtrTy, VoidPtrTy, VoidPtrTy,
>>>> IntTy->getPointerTo()};
>>>> +  llvm::Constant *RegisterFunc = CGM.CreateRuntimeFunction(
>>>> +      llvm::FunctionType::get(IntTy, RegisterFuncParams, false),
>>>> +      "__cudaRegisterFunction");
>>>> +
>>>> +  // Extract GpuBinaryHandle passed as the first argument passed to
>>>> +  // __cuda_register_kernels() and generate __cudaRegisterFunction()
>>>> call for
>>>> +  // each emitted kernel.
>>>> +  llvm::Argument &GpuBinaryHandlePtr =
>>>> *RegisterKernelsFunc->arg_begin();
>>>> +  for (llvm::Function *Kernel : EmittedKernels) {
>>>> +    llvm::Constant *KernelName = makeConstantString(Kernel->getName());
>>>> +    llvm::Constant *NullPtr =
>>>> llvm::ConstantPointerNull::get(VoidPtrTy);
>>>> +    llvm::Value *args[] = {
>>>> +        &GpuBinaryHandlePtr, Builder.CreateBitCast(Kernel, VoidPtrTy),
>>>> +        KernelName, KernelName, llvm::ConstantInt::get(IntTy, -1),
>>>> NullPtr,
>>>> +        NullPtr, NullPtr, NullPtr,
>>>> +        llvm::ConstantPointerNull::get(IntTy->getPointerTo())};
>>>> +    Builder.CreateCall(RegisterFunc, args);
>>>> +  }
>>>> +
>>>> +  Builder.CreateRetVoid();
>>>> +  return RegisterKernelsFunc;
>>>> +}
>>>> +
>>>> +/// Creates a global constructor function for the module:
>>>> +/// \code
>>>> +/// void __cuda_module_ctor(void*) {
>>>> +///     Handle0 = __cudaRegisterFatBinary(GpuBinaryBlob0);
>>>> +///     __cuda_register_kernels(Handle0);
>>>> +///     ...
>>>> +///     HandleN = __cudaRegisterFatBinary(GpuBinaryBlobN);
>>>> +///     __cuda_register_kernels(HandleN);
>>>> +/// }
>>>> +/// \endcode
>>>> +llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
>>>> +  // void __cuda_register_kernels(void* handle);
>>>> +  llvm::Function *RegisterKernelsFunc = makeRegisterKernelsFn();
>>>> +  // void ** __cudaRegisterFatBinary(void *);
>>>> +  llvm::Constant *RegisterFatbinFunc = CGM.CreateRuntimeFunction(
>>>> +      llvm::FunctionType::get(VoidPtrPtrTy, VoidPtrTy, false),
>>>> +      "__cudaRegisterFatBinary");
>>>> +  // struct { int magic, int version, void * gpu_binary, void *
>>>> dont_care };
>>>> +  llvm::StructType *FatbinWrapperTy =
>>>> +      llvm::StructType::get(IntTy, IntTy, VoidPtrTy, VoidPtrTy,
>>>> nullptr);
>>>> +
>>>> +  llvm::Function *ModuleCtorFunc = llvm::Function::Create(
>>>> +      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
>>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_module_ctor",
>>>> &TheModule);
>>>> +  llvm::BasicBlock *CtorEntryBB =
>>>> +      llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);
>>>> +  CGBuilderTy CtorBuilder(Context);
>>>> +
>>>> +  CtorBuilder.SetInsertPoint(CtorEntryBB);
>>>> +
>>>> +  // For each GPU binary, register it with the CUDA runtime and store
>>>> returned
>>>> +  // handle in a global variable and save the handle in
>>>> GpuBinaryHandles vector
>>>> +  // to be cleaned up in destructor on exit. Then associate all known
>>>> kernels
>>>> +  // with the GPU binary handle so CUDA runtime can figure out what to
>>>> call on
>>>> +  // the GPU side.
>>>> +  for (const std::string &GpuBinaryFileName :
>>>> +       CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
>>>> +    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
>>>> +        llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
>>>> +    if (std::error_code EC = GpuBinaryOrErr.getError()) {
>>>> +      CGM.getDiags().Report(diag::err_cannot_open_file) <<
>>>> GpuBinaryFileName
>>>> +                                                        <<
>>>> EC.message();
>>>> +      continue;
>>>> +    }
>>>> +
>>>> +    // Create initialized wrapper structure that points to the loaded
>>>> GPU binary
>>>> +    llvm::Constant *Values[] = {
>>>> +        llvm::ConstantInt::get(IntTy, 0x466243b1), // Fatbin wrapper
>>>> magic.
>>>> +        llvm::ConstantInt::get(IntTy, 1),          // Fatbin version.
>>>> +        makeConstantString(GpuBinaryOrErr.get()->getBuffer(), "", 16),
>>>> // Data.
>>>> +        llvm::ConstantPointerNull::get(VoidPtrTy)}; // Unused in
>>>> fatbin v1.
>>>> +    llvm::GlobalVariable *FatbinWrapper = new llvm::GlobalVariable(
>>>> +        TheModule, FatbinWrapperTy, true,
>>>> llvm::GlobalValue::InternalLinkage,
>>>> +        llvm::ConstantStruct::get(FatbinWrapperTy, Values),
>>>> +        "__cuda_fatbin_wrapper");
>>>> +
>>>> +    // GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
>>>> +    llvm::CallInst *RegisterFatbinCall = CtorBuilder.CreateCall(
>>>> +        RegisterFatbinFunc,
>>>> +        CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy));
>>>> +    llvm::GlobalVariable *GpuBinaryHandle = new llvm::GlobalVariable(
>>>> +        TheModule, VoidPtrPtrTy, false,
>>>> llvm::GlobalValue::InternalLinkage,
>>>> +        llvm::ConstantPointerNull::get(VoidPtrPtrTy),
>>>> "__cuda_gpubin_handle");
>>>> +    CtorBuilder.CreateStore(RegisterFatbinCall, GpuBinaryHandle,
>>>> false);
>>>> +
>>>> +    // Call __cuda_register_kernels(GpuBinaryHandle);
>>>> +    CtorBuilder.CreateCall(RegisterKernelsFunc, RegisterFatbinCall);
>>>> +
>>>> +    // Save GpuBinaryHandle so we can unregister it in destructor.
>>>> +    GpuBinaryHandles.push_back(GpuBinaryHandle);
>>>> +  }
>>>> +
>>>> +  CtorBuilder.CreateRetVoid();
>>>> +  return ModuleCtorFunc;
>>>> +}
>>>> +
>>>> +/// Creates a global destructor function that unregisters all GPU code
>>>> blobs
>>>> +/// registered by constructor.
>>>> +/// \code
>>>> +/// void __cuda_module_dtor(void*) {
>>>> +///     __cudaUnregisterFatBinary(Handle0);
>>>> +///     ...
>>>> +///     __cudaUnregisterFatBinary(HandleN);
>>>> +/// }
>>>> +/// \endcode
>>>> +llvm::Function *CGNVCUDARuntime::makeModuleDtorFunction() {
>>>> +  // void __cudaUnregisterFatBinary(void ** handle);
>>>> +  llvm::Constant *UnregisterFatbinFunc = CGM.CreateRuntimeFunction(
>>>> +      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
>>>> +      "__cudaUnregisterFatBinary");
>>>> +
>>>> +  llvm::Function *ModuleDtorFunc = llvm::Function::Create(
>>>> +      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
>>>> +      llvm::GlobalValue::InternalLinkage, "__cuda_module_dtor",
>>>> &TheModule);
>>>> +  llvm::BasicBlock *DtorEntryBB =
>>>> +      llvm::BasicBlock::Create(Context, "entry", ModuleDtorFunc);
>>>> +  CGBuilderTy DtorBuilder(Context);
>>>> +  DtorBuilder.SetInsertPoint(DtorEntryBB);
>>>> +
>>>> +  for (llvm::GlobalVariable *GpuBinaryHandle : GpuBinaryHandles) {
>>>> +    DtorBuilder.CreateCall(UnregisterFatbinFunc,
>>>> +                           DtorBuilder.CreateLoad(GpuBinaryHandle,
>>>> false));
>>>> +  }
>>>> +
>>>> +  DtorBuilder.CreateRetVoid();
>>>> +  return ModuleDtorFunc;
>>>> +}
>>>> +
>>>>  CGCUDARuntime *CodeGen::CreateNVCUDARuntime(CodeGenModule &CGM) {
>>>>    return new CGNVCUDARuntime(CGM);
>>>>  }
>>>>
>>>> Modified: cfe/trunk/lib/CodeGen/CGCUDARuntime.h
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDARuntime.h?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/CodeGen/CGCUDARuntime.h (original)
>>>> +++ cfe/trunk/lib/CodeGen/CGCUDARuntime.h Thu May  7 14:34:16 2015
>>>> @@ -16,6 +16,10 @@
>>>>  #ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
>>>>  #define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
>>>>
>>>> +namespace llvm {
>>>> +class Function;
>>>> +}
>>>> +
>>>>  namespace clang {
>>>>
>>>>  class CUDAKernelCallExpr;
>>>> @@ -39,10 +43,17 @@ public:
>>>>    virtual RValue EmitCUDAKernelCallExpr(CodeGenFunction &CGF,
>>>>                                          const CUDAKernelCallExpr *E,
>>>>                                          ReturnValueSlot ReturnValue);
>>>> -
>>>> -  virtual void EmitDeviceStubBody(CodeGenFunction &CGF,
>>>> -                                  FunctionArgList &Args) = 0;
>>>>
>>>> +  /// Emits a kernel launch stub.
>>>> +  virtual void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList
>>>> &Args) = 0;
>>>> +
>>>> +  /// Constructs and returns a module initialization function or
>>>> nullptr if it's
>>>> +  /// not needed. Must be called after all kernels have been emitted.
>>>> +  virtual llvm::Function *makeModuleCtorFunction() = 0;
>>>> +
>>>> +  /// Returns a module cleanup function or nullptr if it's not needed.
>>>> +  /// Must be called after ModuleCtorFunction
>>>> +  virtual llvm::Function *makeModuleDtorFunction() = 0;
>>>>  };
>>>>
>>>>  /// Creates an instance of a CUDA runtime class.
>>>>
>>>> Modified: cfe/trunk/lib/CodeGen/CodeGenFunction.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenFunction.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/CodeGen/CodeGenFunction.cpp (original)
>>>> +++ cfe/trunk/lib/CodeGen/CodeGenFunction.cpp Thu May  7 14:34:16 2015
>>>> @@ -878,7 +878,7 @@ void CodeGenFunction::GenerateCode(Globa
>>>>    else if (getLangOpts().CUDA &&
>>>>             !getLangOpts().CUDAIsDevice &&
>>>>             FD->hasAttr<CUDAGlobalAttr>())
>>>> -    CGM.getCUDARuntime().EmitDeviceStubBody(*this, Args);
>>>> +    CGM.getCUDARuntime().emitDeviceStub(*this, Args);
>>>>    else if (isa<CXXConversionDecl>(FD) &&
>>>>
>>>> cast<CXXConversionDecl>(FD)->isLambdaToBlockPointerConversion()) {
>>>>      // The lambda conversion to block pointer is special; the
>>>> semantics can't be
>>>>
>>>> Modified: cfe/trunk/lib/CodeGen/CodeGenModule.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenModule.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/CodeGen/CodeGenModule.cpp (original)
>>>> +++ cfe/trunk/lib/CodeGen/CodeGenModule.cpp Thu May  7 14:34:16 2015
>>>> @@ -350,6 +350,13 @@ void CodeGenModule::Release() {
>>>>    if (ObjCRuntime)
>>>>      if (llvm::Function *ObjCInitFunction =
>>>> ObjCRuntime->ModuleInitFunction())
>>>>        AddGlobalCtor(ObjCInitFunction);
>>>> +  if (Context.getLangOpts().CUDA &&
>>>> !Context.getLangOpts().CUDAIsDevice &&
>>>> +      CUDARuntime) {
>>>> +    if (llvm::Function *CudaCtorFunction =
>>>> CUDARuntime->makeModuleCtorFunction())
>>>> +      AddGlobalCtor(CudaCtorFunction);
>>>> +    if (llvm::Function *CudaDtorFunction =
>>>> CUDARuntime->makeModuleDtorFunction())
>>>> +      AddGlobalDtor(CudaDtorFunction);
>>>> +  }
>>>>    if (PGOReader && PGOStats.hasDiagnostics())
>>>>      PGOStats.reportDiagnostics(getDiags(),
>>>> getCodeGenOpts().MainFileName);
>>>>    EmitCtorList(GlobalCtors, "llvm.global_ctors");
>>>> @@ -3678,4 +3685,3 @@ void CodeGenModule::EmitOMPThreadPrivate
>>>>        CXXGlobalInits.push_back(InitFunction);
>>>>    }
>>>>  }
>>>> -
>>>>
>>>> Modified: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInvocation.cpp?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/Frontend/CompilerInvocation.cpp (original)
>>>> +++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp Thu May  7 14:34:16
>>>> 2015
>>>> @@ -651,6 +651,9 @@ static bool ParseCodeGenArgs(CodeGenOpti
>>>>                        Args.getAllArgValues(OPT_fsanitize_recover_EQ),
>>>> Diags,
>>>>                        Opts.SanitizeRecover);
>>>>
>>>> +  Opts.CudaGpuBinaryFileNames =
>>>> +      Args.getAllArgValues(OPT_fcuda_include_gpubinary);
>>>> +
>>>>    return Success;
>>>>  }
>>>>
>>>>
>>>> Modified: cfe/trunk/test/CodeGenCUDA/device-stub.cu
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/test/CodeGenCUDA/device-stub.cu?rev=236765&r1=236764&r2=236765&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/test/CodeGenCUDA/device-stub.cu (original)
>>>> +++ cfe/trunk/test/CodeGenCUDA/device-stub.cu Thu May  7 14:34:16 2015
>>>> @@ -1,7 +1,21 @@
>>>> -// RUN: %clang_cc1 -emit-llvm %s -o - | FileCheck %s
>>>> +// RUN: %clang_cc1 -emit-llvm %s -fcuda-include-gpubinary %s -o - |
>>>> FileCheck %s
>>>>
>>>>  #include "Inputs/cuda.h"
>>>>
>>>> +// Make sure that all parts of GPU code init/cleanup are there:
>>>> +// * constant unnamed string with the kernel name
>>>> +// CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00",
>>>> align 1
>>>> +// * constant unnamed string with GPU binary
>>>> +// CHECK: private unnamed_addr constant{{.*}}\00"
>>>> +// * constant struct that wraps GPU binary
>>>> +// CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*,
>>>> i8* }
>>>> +// CHECK:       { i32 1180844977, i32 1, {{.*}}, i64 0, i64 0), i8*
>>>> null }
>>>> +// * variable to save GPU binary handle after initialization
>>>> +// CHECK: @__cuda_gpubin_handle = internal global i8** null
>>>> +// * Make sure our constructor/destructor was added to global
>>>> ctor/dtor list.
>>>> +// CHECK: @llvm.global_ctors = appending global
>>>> {{.*}}@__cuda_module_ctor
>>>> +// CHECK: @llvm.global_dtors = appending global
>>>> {{.*}}@__cuda_module_dtor
>>>> +
>>>>  // Test that we build the correct number of calls to cudaSetupArgument
>>>> followed
>>>>  // by a call to cudaLaunch.
>>>>
>>>> @@ -11,3 +25,28 @@
>>>>  // CHECK: call{{.*}}cudaSetupArgument
>>>>  // CHECK: call{{.*}}cudaLaunch
>>>>  __global__ void kernelfunc(int i, int j, int k) {}
>>>> +
>>>> +// Test that we've built correct kernel launch sequence.
>>>> +// CHECK: define{{.*}}hostfunc
>>>> +// CHECK: call{{.*}}cudaConfigureCall
>>>> +// CHEKC: call{{.*}}kernelfunc
>>>> +void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }
>>>> +
>>>> +// Test that we've built a function to register kernels
>>>> +// CHECK: define internal void @__cuda_register_kernels
>>>> +// CHECK: call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc
>>>> +
>>>> +// Test that we've built contructor..
>>>> +// CHECK: define internal void @__cuda_module_ctor
>>>> +//   .. that calls __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)
>>>> +// CHECK: call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper
>>>> +//   .. stores return value in __cuda_gpubin_handle
>>>> +// CHECK-NEXT: store{{.*}}__cuda_gpubin_handle
>>>> +//   .. and then calls __cuda_register_kernels
>>>> +// CHECK-NEXT: call void @__cuda_register_kernels
>>>> +
>>>> +// Test that we've created destructor.
>>>> +// CHECK: define internal void @__cuda_module_dtor
>>>> +// CHECK: load{{.*}}__cuda_gpubin_handle
>>>> +// CHECK-NEXT: call void @__cudaUnregisterFatBinary
>>>> +
>>>>
>>>>
>>>> _______________________________________________
>>>> cfe-commits mailing list
>>>> [email protected]
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>>
>>>
>>>
>>
>>
>> --
>> --Artem Belevich
>>
>
>
>
> --
> --Artem Belevich
>



-- 
--Artem Belevich

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: r236765 - [cuda] Include GPU binary into host object file and generate init/deinit code.

Reply via email to