estewart08 added a comment.

In D99432#2728788 <https://reviews.llvm.org/D99432#2728788>, @ABataev wrote:

> In D99432#2726997 <https://reviews.llvm.org/D99432#2726997>, @estewart08 
> wrote:
>
>> In D99432#2726845 <https://reviews.llvm.org/D99432#2726845>, @ABataev wrote:
>>
>>> In D99432#2726588 <https://reviews.llvm.org/D99432#2726588>, @estewart08 
>>> wrote:
>>>
>>>> In D99432#2726391 <https://reviews.llvm.org/D99432#2726391>, @ABataev 
>>>> wrote:
>>>>
>>>>> In D99432#2726337 <https://reviews.llvm.org/D99432#2726337>, @estewart08 
>>>>> wrote:
>>>>>
>>>>>> In D99432#2726060 <https://reviews.llvm.org/D99432#2726060>, @ABataev 
>>>>>> wrote:
>>>>>>
>>>>>>> In D99432#2726050 <https://reviews.llvm.org/D99432#2726050>, 
>>>>>>> @estewart08 wrote:
>>>>>>>
>>>>>>>> In D99432#2726025 <https://reviews.llvm.org/D99432#2726025>, @ABataev 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> In D99432#2726019 <https://reviews.llvm.org/D99432#2726019>, 
>>>>>>>>> @estewart08 wrote:
>>>>>>>>>
>>>>>>>>>> In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do 
>>>>>>>>>> not see how this helps SPMD mode with team privatization of 
>>>>>>>>>> declarations in-between target teams and parallel regions.
>>>>>>>>>
>>>>>>>>> DiŠ² you try the reproducer with the applied patch?
>>>>>>>>
>>>>>>>> Yes, I still saw the test fail, although it was not with latest 
>>>>>>>> llvm-project. Are you saying the reproducer passes for you?
>>>>>>>
>>>>>>> I don't have CUDA installed but from what I see in the LLVM IR it shall 
>>>>>>> pass. Do you have a debug log, does it crashes or produces incorrect 
>>>>>>> results?
>>>>>>
>>>>>> This is on an AMDGPU but I assume the behavior would be similar for 
>>>>>> NVPTX.
>>>>>>
>>>>>> It produces incorrect/incomplete results in the dist[0] index after a 
>>>>>> manual reduction and in turn the final global gpu_results array is 
>>>>>> incorrect.
>>>>>> When thread 0 does a reduction into dist[0] it has no knowledge of 
>>>>>> dist[1] having been updated by thread 1. Which tells me the array is 
>>>>>> still thread private.
>>>>>> Adding some printfs, looking at one teams' output:
>>>>>>
>>>>>> SPMD
>>>>>>
>>>>>>   Thread 0: dist[0]: 1
>>>>>>   Thread 0: dist[1]: 0  // This should be 1
>>>>>>   After reduction into dist[0]: 1  // This should be 2
>>>>>>   gpu_results = [1,1]  // [2,2] expected
>>>>>>
>>>>>> Generic Mode:
>>>>>>
>>>>>>   Thread 0: dist[0]: 1
>>>>>>   Thread 0: dist[1]: 1   
>>>>>>   After reduction into dist[0]: 2
>>>>>>   gpu_results = [2,2]
>>>>>
>>>>> Hmm, I would expect a crash if the array was allocated in the local 
>>>>> memory. Could you try to add some more printfs (with data and addresses 
>>>>> of the array) to check the results? Maybe there is a data race somewhere 
>>>>> in the code?
>>>>
>>>> As a reminder, each thread updates a unique index in the dist array and 
>>>> each team updates a unique index in gpu_results.
>>>>
>>>> SPMD - shows each thread has a unique address for dist array
>>>>
>>>>   Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
>>>>   Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc
>>>>   
>>>>   Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
>>>>   Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4
>>>>   
>>>>   Team 0 Thread 0: After reduction into dist[0]: 1
>>>>   Team 0 Thread 0: gpu_results address: 0x7f92a5000000
>>>>   --------------------------------------------------
>>>>   Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
>>>>   Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c
>>>>   
>>>>   Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
>>>>   Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184
>>>>   
>>>>   Team 1 Thread 0: After reduction into dist[0]: 1
>>>>   Team 1 Thread 0: gpu_results address: 0x7f92a5000000
>>>>   
>>>>   gpu_results[0]: 1
>>>>   gpu_results[1]: 1
>>>>
>>>> Generic - shows each team shares dist array address amongst threads
>>>>
>>>>   Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
>>>>   Team 0 Thread 1: dist[1]: 1, 0x7fac01938884
>>>>   
>>>>   Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
>>>>   Team 0 Thread 0: dist[1]: 1, 0x7fac01938884
>>>>   
>>>>   Team 0 Thread 0: After reduction into dist[0]: 2
>>>>   Team 0 Thread 0: gpu_results address: 0x7fabc5000000
>>>>   --------------------------------------------------
>>>>   Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
>>>>   Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14
>>>>   
>>>>   Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
>>>>   Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14
>>>>   
>>>>   Team 1 Thread 0: After reduction into dist[0]: 2
>>>>   Team 1 Thread 0: gpu_results address: 0x7fabc5000000
>>>
>>> Could you check if it works with `-fno-openmp-cuda-parallel-target-regions` 
>>> option?
>>
>> Unfortunately that crashes:
>> llvm-project/llvm/lib/IR/Instructions.cpp:495: void 
>> llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, 
>> llvm::ArrayRef<llvm::Value*>, 
>> llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): 
>> Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == 
>> Args[i]->getType()) && "Calling a function with a bad signature!"' failed.
>
> Hmm, could you provide a full stack trace?

At this point I am not sure I want to dig into that crash as our llvm-branch is 
not caught up to trunk.

I did build trunk and ran some tests on a sm_70:
-Without this patch: code fails with incomplete results
-Without this patch and with -fno-openmp-cuda-parallel-target-regions: code 
fails with incomplete results

-With this patch: code fails with incomplete results (thread private array)
Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac

Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7c1ebc800000

Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc

Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7c1ebc800000

gpu_results[0]: 1
gpu_results[1]: 1
FAIL

-With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7a5afc800000

Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7a5afc800000

gpu_results[0]: 2
gpu_results[1]: 2
PASS

I am concerned about team 0 and team 1 having the same address for the dist 
array here.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99432/new/

https://reviews.llvm.org/D99432

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to