On 08/01/2018 09:11 PM, Cesar Philippidis wrote: > On 08/01/2018 07:12 AM, Tom de Vries wrote: > >>>>> + gangs = grids * (blocks / warp_size); >>>> >>>> So, we launch with gangs == grids * workers ? Is that intentional? >>> >>> Yes. At least that's what I've been using in og8. Setting num_gangs = >>> grids alone caused significant slow downs. >>> >> >> Well, what you're saying here is: increasing num_gangs increases >> performance. >> >> You don't explain why you multiply with workers specifically. > > I set it that way because I think the occupancy calculator is > determining the occupancy of a single multiprocessor unit, rather than > the entire GPU. Looking at the og8 code again, I had > > num_gangs = 2 * threads_per_sm / warp_size * dev_size > > which corresponds to > > 2 * grids * blocks / warp_size >
I've done an experiment using the sample simpleOccupancy. The kernel is small, so the blocks returned is the maximum: max_threads_per_block (1024). The grids returned is 10, which I tentatively interpret as num_dev * (max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and max_threads_per_multi_processor == 2048. ] Substituting that into the og8 code, and equating max_threads_per_multi_processor with threads_per_sm, I indeed get num_gangs = 2 * grids * blocks / warp_size. So with this extra information I see how you got there. But I still see no rationale why blocks is used here, and I wonder whether something like num_gangs = grids * 64 would give similar results. Anyway, given that this is what is used on og8, I'm ok with using that, so let's go with: ... gangs = 2 * grids * (blocks / warp_size); ... [ so, including the factor two you explicitly left out from the original patch. Unless you see a pressing reason not to include it. ] Can you repost after retesting? [ note: the updated patch I posted earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ] Thanks, - Tom