On 02/01/2013 06:52 PM, Ralf Karrenberg wrote:
> In general, it does similar things, yes. However, it does not have any
> unnecessary overhead. setjmp/longjmp for example have to store the entire
> context, whereas we only store live variables. Our approach is much more like
> yours and MCUDAs, just that it does not artificially introduce additional
> barriers. Of course, if you can unroll the entire local loop, things are
> definitely more efficient - but I doubt that this is a frequent use case, or 
> am
> I wrong with that?

Yes, it's not feasible with big loops. For those the "loop interchange
approach" would be nice where the WI replication/looping is done inside
the loop (like with loops that include barriers).

>> My plan in the 'loopvec' approach was to leave this type of decisions
>> to the LLVM vectorizer's cost metrics. Just produce it nicely vectorizable
>> input to digest.
>
> Generally a very reasonable idea, the loop vectorizer is just not far enough 
> for
> this yet.

Yep, as far as I see it needs at least a) loop interchange (to handle
the above case efficiently), b) vec2scalar splitting (to not choke with
kernel's vector datatypes), c) intrinsics vectorization. In addition to the
parallel loop recognition that I'm currently trying to upstream.

> As promised on IRC, I attached two graphs with some measurements I did a few
> weeks ago with different drivers.
> An asterisk marks benchmarks that Intel'12 chooses not to vectorize. Our 
> driver
> vectorizes everything, AMD to my knowledge has no vectorization implemented. 
> All
> benchmarks were modified not to use any OpenCL vector types in order to leave
> all vectorization to the driver.
> pocl looks pretty good given that the "unroll as much as possible" approach
> seems not to be really well suited for a "normal" x68 architecture and there 
> is
> no vectorization so far.

Yes, surprisingly good. You might want to try with the 
POCL_VECTORIZE_WORK_GROUPS=1 which uses a modified bb-vectorizer from LLVM.
Its parameters are tuned to the TCE target, though, so it might vectorize
too much and produce lots of overhead instructions. It would be interesting
to see where pocl gets with that if the parameters are tuned more sensibly
(Vlado might have ideas?).

-- 
--Pekka


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to