Hi All, I submitted my proposal via GSOC platform as tiny-many project. On basis on Krill's reply, i decided to work on thread hierarchy manager. The pdf version of proposal can be found here : [1] . Shortly my proposal consist combining dynamic parallelism, extra thread creation in advance and kernel splitting while code generating for GPUs. If you have comments and suggestions are welcome.
Regards. [1]: https://raw.githubusercontent.com/grypp/gcc-proposal-omp4/master/gsoc-gurayozen.pdf Güray Özen ~grypp 2015-03-23 13:58 GMT+01:00 guray ozen <guray.o...@gmail.com>: > Hi Kirill, > > Thread hierarchy management and creation policy is very interesting > topic for me as well. I came across that paper couple weeks ago. > Creating more threads in the beginning and applying suchlike > busy-waiting or if-master algorithm generally works better than > dynamic parallelism due to the overhead of dp. Moreover compiler might > close some optimizations when dp is enable. This paper Cuda-np[1] is > also interesting about managing threads. And its idea is very close > that to create more thread in advance instead of using dynamic > parallelism. However in the other hand, sometimes dp has better > performance since it let create new thread hierarchy. > > In order to clarify, I prepared 2 examples while using dynamic > parallelism and creating more threads in advance. > *(1st example) Better result is dynamic parallelism. > *(2nd example) Better result is creating more threads in advance > > 1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0 > *(prop0.c) Has 4 nested iteration > *(prop0.c:10)will put small array into shared memory > *Iteration size of first two loop is expressed explicitly. even if > they become clear in rt, ptx/spir can be changed > *Last two iteration is sizes are dynamic and dependent of first two > iterations' induction variables > *(prop0.c:24 - 28) there are array accessing in very inefficient way > (non-coalesced) > -If we put (prop0.c:21) #parallel for > -*It will create another kernel (prop0_dynamic.cu:34) > -*array accessing style will change (prop0_dynamic.cu:48 - 52) > > Basically advantages of creating dynamic parallelism in this point: > 1- Accessing style to array is changed with coalasced > 2- we could get rid of 3rd and 4th for loop since we could create > thread as iteration size. (little advantage in terms of thread > divergencency) > > 2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1 > *Has 2 nested iteration > *Innermost loop has reduction > *I put 3 possible generated cuda code example > *1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account > prop1.c:12 > *2 - prop1_createMoreThread.cu : create more thread for innermost > loop. Do reduction with extra threads. communicate by using shared > memory. > *3 - prop1_dynamic.cu : create child kernel. Communicate by using > global memory. but allocate global memory in advance at > prop1_dynamic.cu:5 > > Full version of prop1 calculates nbody. I benchmarked with y reserach > compiler [2] and put results here > https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf > . As is seen from that figure, 2nd kernel has best performance. > > > When we compare these 2 example, my roughly idea about this issue > that, it might be good idea to implement an inspector by using > compiler analyzing algorithms in order to decide whether dynamic > parallelism will be used or not. Thus it also can be possible to avoid > extra slowdown since compiler closes optimization when dp is enable. > Besides there is some another cases exist while we can take advantage > of dp such as recursive algorithms. Moreover using stream is available > even if not guarantee concurrency (it also causes overhead). In > addition to this, i can work on if-master or busy-waiting logic. > > I am really willing to work on thread hierarchy management and > creation policy. if it is interesting for gcc, how can i progress on > this topic? > > > By the way, i haven't worked on #omp simd. it could be match with > warps (if there is no dependency among loops). (in nvidia side) since > threads in same warp can read their data with __shfl, data clauses can > be used to enhance performance. (Not sure) > > [1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf > [2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16 > Güray Özen > ~grypp > > > > 2015-03-20 15:47 GMT+01:00 Kirill Yukhin <kirill.yuk...@gmail.com>: >> Hello Güray, >> >> On 20 Mar 12:14, guray ozen wrote: >>> I've started to prepare my gsoc proposal for gcc's openmp for gpus. >> I think that here is wide range for exploration. As you know, OpenMP 4 >> contains vectorization pragmas (`pragma omp simd') which not perfectly >> suites for GPGPU. >> Another problem is how to create threads dynamically on GPGPU. As far as >> we understand it there're two possible solutions: >> 1. Use dynamic parallelism available in recent API (launch new kernel from >> target) >> 2. Estimate maximum thread number on host and start them all from host, >> making unused threads busy-waiting >> There's a paper which investigates both approaches [1], [2]. >> >>> However i'm little bit confused about which ideas, i mentioned last my >>> mail, should i propose or which one of them is interesting for gcc. >>> I'm willing to work on data clauses to enhance performance of shared >>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft >>> version. How do you think i should propose idea? >> We're going to work on OpenMP 4.1 offloading features. >> >> [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf >> [2] - http://dl.acm.org/citation.cfm?id=2688364 >> >> -- >> Thanks, K