Re: Offloading GSOC 2015

guray ozen Sat, 28 Mar 2015 10:01:06 -0700

Hi All,

I submitted my proposal via GSOC platform as tiny-many project. On
basis on Krill's reply, i decided to work on thread hierarchy manager.
The pdf version of proposal can be found here : [1] . Shortly my
proposal consist combining dynamic parallelism, extra thread creation
in advance and kernel splitting while code generating for GPUs. If you
have comments and suggestions are welcome.


Regards.

[1]: 
https://raw.githubusercontent.com/grypp/gcc-proposal-omp4/master/gsoc-gurayozen.pdf
Güray Özen
~grypp



2015-03-23 13:58 GMT+01:00 guray ozen <guray.o...@gmail.com>:
> Hi Kirill,
>
> Thread hierarchy management and creation policy is very interesting
> topic for me as well. I came across that paper couple weeks ago.
> Creating more threads in the beginning and applying suchlike
> busy-waiting or if-master algorithm generally works better than
> dynamic parallelism due to the overhead of dp. Moreover compiler might
> close some optimizations when dp is enable. This paper Cuda-np[1] is
> also interesting about managing threads. And its idea is very close
> that to create more thread in advance instead of using dynamic
> parallelism. However in the other hand, sometimes dp has better
> performance since it let create new thread hierarchy.
>
> In order to clarify, I prepared 2 examples while using dynamic
> parallelism and creating more threads in advance.
> *(1st example)  Better result is dynamic parallelism.
> *(2nd example) Better result is creating more threads in advance
>
> 1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0
> *(prop0.c) Has 4 nested iteration
> *(prop0.c:10)will put small array into shared memory
> *Iteration size of first two loop is expressed explicitly. even if
> they become clear in rt, ptx/spir can be changed
> *Last two iteration is sizes are dynamic and dependent of first two
> iterations' induction variables
> *(prop0.c:24 - 28) there are array accessing in very inefficient way
> (non-coalesced)
> -If we put (prop0.c:21) #parallel for
> -*It will create another kernel (prop0_dynamic.cu:34)
> -*array accessing style will change  (prop0_dynamic.cu:48 - 52)
>
> Basically advantages of creating dynamic parallelism in this point:
> 1- Accessing style to array is changed with coalasced
> 2- we could get rid of 3rd and 4th for loop since we could create
> thread as iteration size. (little advantage in terms of thread
> divergencency)
>
> 2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1
> *Has 2 nested iteration
> *Innermost loop has reduction
> *I put 3 possible generated cuda code example
> *1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account
> prop1.c:12
> *2 - prop1_createMoreThread.cu : create more thread for innermost
> loop. Do reduction with extra threads. communicate by using shared
> memory.
> *3 - prop1_dynamic.cu : create child kernel. Communicate by using
> global memory. but allocate global memory in advance at
> prop1_dynamic.cu:5
>
> Full version of prop1 calculates nbody. I benchmarked with y reserach
> compiler [2] and put results here
> https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf
> . As is seen from that figure, 2nd kernel has best performance.
>
>
> When we compare these 2 example, my roughly idea about this issue
> that,  it might be good idea to implement an inspector by using
> compiler analyzing algorithms in order to decide whether dynamic
> parallelism will be used or not. Thus it also can be possible to avoid
> extra slowdown since compiler closes optimization when dp is enable.
> Besides there is some another cases exist while we can take advantage
> of dp such as recursive algorithms. Moreover using stream is available
> even if not guarantee concurrency (it also causes overhead). In
> addition to this, i can work on if-master or busy-waiting logic.
>
> I am really willing to work on thread hierarchy management and
> creation policy. if it is interesting for gcc, how can i progress on
> this topic?
>
>
> By the way, i haven't worked on #omp simd. it could be match with
> warps (if there is no dependency among loops). (in nvidia side) since
> threads in same warp can read their data with __shfl, data clauses can
> be used to enhance performance. (Not sure)
>
> [1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf
> [2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
> Güray Özen
> ~grypp
>
>
>
> 2015-03-20 15:47 GMT+01:00 Kirill Yukhin <kirill.yuk...@gmail.com>:
>> Hello Güray,
>>
>> On 20 Mar 12:14, guray ozen wrote:
>>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
>> I think that here is wide range for exploration. As you know, OpenMP 4
>> contains vectorization pragmas (`pragma omp simd') which not perfectly
>> suites for GPGPU.
>> Another problem is how to create threads dynamically on GPGPU. As far as
>> we understand it there're two possible solutions:
>>   1. Use dynamic parallelism available in recent API (launch new kernel from
>>   target)
>>   2. Estimate maximum thread number on host and start them all from host,
>>   making unused threads busy-waiting
>> There's a paper which investigates both approaches [1], [2].
>>
>>> However i'm little bit confused about which ideas, i mentioned last my
>>> mail, should i propose or which one of them is interesting for gcc.
>>> I'm willing to work on data clauses to enhance performance of shared
>>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
>>> version. How do you think i should propose idea?
>> We're going to work on OpenMP 4.1 offloading features.
>>
>> [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf
>> [2] - http://dl.acm.org/citation.cfm?id=2688364
>>
>> --
>> Thanks, K

Re: Offloading GSOC 2015

Reply via email to