On 08/01/2018 09:11 PM, Cesar Philippidis wrote:
> On 08/01/2018 07:12 AM, Tom de Vries wrote:
>
>>>>> + gangs = grids * (blocks / warp_size);
>>>>
>>>> So, we launch with gangs == grids * workers ? Is that intentional?
>>>
>>> Yes. At least that's what I've been using in og8. Setting num_gangs =
>>> grids alone caused significant slow downs.
>>>
>>
>> Well, what you're saying here is: increasing num_gangs increases
>> performance.
>>
>> You don't explain why you multiply with workers specifically.
>
> I set it that way because I think the occupancy calculator is
> determining the occupancy of a single multiprocessor unit, rather than
> the entire GPU. Looking at the og8 code again, I had
>
> num_gangs = 2 * threads_per_sm / warp_size * dev_size
>
> which corresponds to
>
> 2 * grids * blocks / warp_size
>
I've done an experiment using the sample simpleOccupancy. The kernel is
small, so the blocks returned is the maximum: max_threads_per_block (1024).
The grids returned is 10, which I tentatively interpret as num_dev *
(max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and
max_threads_per_multi_processor == 2048. ]
Substituting that into the og8 code, and equating
max_threads_per_multi_processor with threads_per_sm, I indeed get
num_gangs = 2 * grids * blocks / warp_size.
So with this extra information I see how you got there.
But I still see no rationale why blocks is used here, and I wonder
whether something like num_gangs = grids * 64 would give similar results.
Anyway, given that this is what is used on og8, I'm ok with using that,
so let's go with:
...
gangs = 2 * grids * (blocks / warp_size);
...
[ so, including the factor two you explicitly left out from the original
patch. Unless you see a pressing reason not to include it. ]
Can you repost after retesting? [ note: the updated patch I posted
earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]
Thanks,
- Tom