Hi,

I just managed to integrate libWFV into pocl and got the first few results.

Be aware that these are just first measurements under non-reproducible 
circumstances, no final conclusion should be drawn from them!

The benchmarks I chose were simply those that worked immediately, I 
didn't take a look at the generated code whatsoever. Be aware that those 
are mostly benchmarks that are not really well suited for vectorization. 
I don't recall which of those below the Intel driver refuses to 
vectorize, but it's pretty clear that for many of those, vectorization 
may better be disabled.

             | pocl-orig | pocl-wfv | Intel | AMD | WFVOpenCL |
------------|-----------|----------|-------|-----|-----------|
BitonicSort |0.22       |0.38      |0.67   |0.84 |0.12       |
------------|-----------|----------|-------|-----|-----------|
DCT         |0.72       |0.39      |0.31   |0.55 |0.42       |
------------|-----------|----------|-------|-----|-----------|
FastWalshTr.|1.0        |1.1       |1.1    |1.3  |1.1        |
------------|-----------|----------|-------|-----|-----------|
FloydWarsh. |0.4        |0.59      |0.49   |2.1  |0.55       |
------------|-----------|----------|-------|-----|-----------|
Histogram   |0.31       |0.26      |0.29   |0.33 |0.36       |
------------|-----------|----------|-------|-----|-----------|

There's one other bad thing that I just noticed: These numbers are 
kernel times with pocl reusing the compilation results. If I only 
measure one run after deleting the temporary files, pocl is *really* 
slow (roughly 1.5-3 times slower). This suggests that the implementation 
suffers a lot from using scripts, command line tools like opt, and thus 
disk I/O.
Still, the raw kernel performance looks really really good.

Now I'm going to try to clean up the code and make the implementation 
recognize llvm.muladd and the builtin intrinsics (e.g. for sqrt) which 
currently result in a crash for benchmarks like Mandelbrot, 
BlackScholes, NBody, etc. :p.

On a side note: It was really pretty easy to integrate my stuff, I just 
run a wrapper pass that invokes WFV on the kernel before all your custom 
transformations start, and adjust the loop induction variable increment 
of WILoops. It's currently only a hack but shouldn't be hard to make 
that code depend on an environment variable or build flag.

Cheers,
Ralf

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to