On 08/01/2018 09:11 PM, Cesar Philippidis wrote:
> On 08/01/2018 07:12 AM, Tom de Vries wrote:
> 
>>>>> +       gangs = grids * (blocks / warp_size);
>>>>
>>>> So, we launch with gangs == grids * workers ? Is that intentional?
>>>
>>> Yes. At least that's what I've been using in og8. Setting num_gangs =
>>> grids alone caused significant slow downs.
>>>
>>
>> Well, what you're saying here is: increasing num_gangs increases
>> performance.
>>
>> You don't explain why you multiply with workers specifically.
> 
> I set it that way because I think the occupancy calculator is
> determining the occupancy of a single multiprocessor unit, rather than
> the entire GPU. Looking at the og8 code again, I had
> 
>    num_gangs = 2 * threads_per_sm / warp_size * dev_size
> 
> which corresponds to
> 
>    2 * grids * blocks / warp_size
> 

I've done an experiment using the sample simpleOccupancy. The kernel is
small, so the blocks returned is the maximum: max_threads_per_block (1024).

The grids returned is 10, which I tentatively interpret as num_dev *
(max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and
max_threads_per_multi_processor == 2048. ]

Substituting that into the og8 code, and equating
max_threads_per_multi_processor with threads_per_sm, I indeed get

num_gangs = 2 * grids * blocks / warp_size.

So with this extra information I see how you got there.

But I still see no rationale why blocks is used here, and I wonder
whether something like num_gangs = grids * 64 would give similar results.

Anyway, given that this is what is used on og8, I'm ok with using that,
so let's go with:
...
              gangs = 2 * grids * (blocks / warp_size);
...
[ so, including the factor two you explicitly left out from the original
patch. Unless you see a pressing reason not to include it. ]

Can you repost after retesting? [ note: the updated patch I posted
earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]

Thanks,
- Tom

Reply via email to