================ @@ -1614,6 +1650,50 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createParallel( IfCondition, NumThreads, PrivTID, PrivTIDAddr, ThreadID, ToBeDeletedVec); }; + + std::optional<omp::OMPTgtExecModeFlags> ExecMode = + getTargetKernelExecMode(*OuterFn); + + // If OuterFn is not a Generic kernel, skip custom allocation. This causes + // the CodeExtractor to follow its default behavior. Otherwise, we need to + // use device shared memory to allocate argument structures. + if (ExecMode && *ExecMode & OMP_TGT_EXEC_MODE_GENERIC) { + OI.CustomArgAllocatorCB = [this, + EntryBB](BasicBlock *, BasicBlock::iterator, + Type *ArgTy, const Twine &Name) { + // Instead of using the insertion point provided by the CodeExtractor, + // here we need to use the block that eventually calls the outlined + // function for the `parallel` construct. + // + // The reason is that the explicit deallocation call will be inserted + // within the outlined function, whereas the alloca insertion point + // might actually be located somewhere else in the caller. This becomes + // a problem when e.g. `parallel` is inside of a `distribute` construct, + // because the deallocation would be executed multiple times and the + // allocation just once (outside of the loop). + // + // TODO: Ideally, we'd want to do the allocation and deallocation + // outside of the `parallel` outlined function, hence using here the + // insertion point provided by the CodeExtractor. We can't do this at + // the moment because there is currently no way of passing an eligible + // insertion point for the explicit deallocation to the CodeExtractor, + // as that block is created (at least when nested inside of + // `distribute`) sometime after createParallel() completed, so it can't + // be stored in the OutlineInfo structure here. ---------------- skatrak wrote:
Did you have any place in mind as to where a pointer to that block could be stored? The problem I see is that we'd have to link it to the exit block that's paired to the alloca block, which may have been created by the translation of any operation somewhere N levels up the call stack. I spent about a week trying to get a proper deallocation block to be paired to the `OI.OuterAllocaBB`, and here's a summary of what I found: - the insertion point used to figure out what the allocation block should be is passed to the `createParallel` call; - this is obtained in `convertOmpParallel`, during MLIR to LLVM IR translation, by looking at the nearest `OpenMPAllocaStackFrame` or by using the function's entry block if no eligible stack frames are found; - `OpenMPAllocaStackFrame`s are created right before translating the body of some operations, which is e.g. the case of `omp.distribute`, and they will point to a block that has been expressly created for child ops to perform allocations; - other entry/exit blocks might be introduced in between, since not all operations push new `OpenMPAllocaStackFrame`s, so by the time we get to `OpenMPIRBuilder::createParallel` we have no way of knowing what block is supposed to be the exit for the given outer alloca block; and - a pointer to the eligible deallocation block associated to the alloca block is only currently available at the point where both are created (see e.g. `OpenMPIRBuilder::createDistribute`). So, after this previous investigation and looking at it again with fresh eyes, I think that one way we could make this work would be to: 1. Extend `OpenMPAllocaStackFrame` to hold alloc and dealloc insertion points. 2. Add to `OpenMPIRBuilder::BodyGenCallbackTy` (and potentially to other similar callback types) an additional insert point argument for deallocations. 3. Update body callbacks creating an `OpenMPAllocaStackFrame` in MLIR to LLVM IR to store the new deallocation insertion point argument into the stack frame. 4. Modify OMPIRBuilder translation functions taking the aforementioned callback(s) to pass their exit blocks as deallocation points when generating their body. 5. Add a new `OuterDeallocBB` to `OpenMPIRBuilder::OutlineInfo` and pass it as the deallocation block to the `CodeExtractor` constructor in `OpenMPIRBuilder::finalize`. 6. Update `OpenMPIRBuilder::createParallel` to take the new `deallocIP` returned by `findAllocaInsertPoint` and store it as the `OI.OuterDeallocBB`. I think that should work for Flang and let us use the `CodeExtractor`-provided insert point for both allocations and deallocations, while also doing this in the right spot (i.e. not inside of a loop whenever possible). However, there's also Clang uses of the OMPIRBuilder to contend with, and a quick look tells me we won't have easy access to an exit block as we do for Flang, though I'm not too familiar with it. As it stands, Clang won't run this one code path where we actually use that deallocation insertion point, so passing `null` around and having a sane default behavior will work, but eventually we'd have to deal with that issue. Do you have an opinion on this? I can work on those changes as another patch in the stack to hopefully make things a bit better or I could make the changes here, though I think such fundamental changes are better done as an independent PR. Or maybe there's a better way I haven't thought about, let me know. https://github.com/llvm/llvm-project/pull/150925 _______________________________________________ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits