Hi Yue Hu, On 06/13/2013 01:07 AM, Yue Hu wrote: > However, no matter how much computation is allocated to CPU, it has a time > overhead of about 3 seconds. For better illustration, I listed the code that > uses CPU to do the whole computation as below. The red lines are the execution > time I measured for a 1024x1024 matrix multiplication case. I also tested > other matrix size, the time overhead is still about 3 seconds.
My first guess it the compilation overhead. You can verify this by running the kernel again in the same program: this time the compilation overhead should not be there, thanks to the caching of the compilation results. This relates to the recent discussion of a proper compiler cache that stores the compilation results over the program launches and has proper checking of cache entry validity. Have you compiled your LLVM/Clang with optimizations on? If you have an unoptimized LLVM it can easily account for an order of magnitude kernel compiler slowdown. There can be other surprising performance issues. E.g., I fixed one when I noticed that LLVM 3.3 produced fmuladd intrinsics automatically and in my machine those intrinsics were converted to uninlineable math library calls which caused significant performance regressions in some of the cases. I use the 'valgrind' tool to produce execution profiles. This is something you might want to do also unless it's clearly the compilation overhead you are seeing: valgrind --tool=cachegrind --trace-children=yes ./your_opencl_program Then it dumps several traces which you can visualize, e.g., with kcachegrind to find the hot spots. BR, -- Pekka ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
