Re: [Lcms-user] Threading performance in LCMS2

Sebastien Leon Thu, 30 May 2013 13:32:11 -0700

Hi Richard,

Sorry to reply so late about this thread but I was away for few weeks.

> I'm trying to make my transform go fast. I've got a 1920x1080
> RGB image being transformed from sRGB to the display profile.

Ok, sounds very similar to what I have to do in my application.
(I also had to manage premultiplied alpha, but just ignore this in my
source)

I'm not a lcms expert, so I simply optimized the code path (involved by
the transform that my application always uses) by unrolling some critic
loops.
In other words, I made the code less generic but it is several times
faster for my particular need.
Warning : this is not suitable for everyone  & I consider my
modification as a dirty hack but it may be of any help to you...

Note that only the TYPE_BGRA_8 format has been optimized (normal
transformation + soft proofing transform). If you are using another
format, you'll need to modify slightly the source to get the same
performances, otherwise you'll see no difference with legacy lcms 2.4.

I made a little bench to test to improvements (done on my Core2Duo).
Here are the result :
- littleCMS legacy code : 92.12 CPU Cycle per pixel transformed
- littleCMS hacked : 27.75 CPU Cycle per pixel transformed

So I get a x 3.3 boost.
I also sliced the image and ran several thread, thanks to Qt threading
model. Final improvement was ~ X3.4 with 4 physical CPU (I guess you
could get a x1.8 boost for 2 physical CPU).
As overall performances were ~ X10, I stopped to dig further and use
this code daily.
Note that I also rewrote the critic loop with SSE4 assembly code (just
for fun). I found no real improvement because most of the work is about
memory exchange... So I kept the basic C code...

Modified code can be downloaded here :
http://sebastienleon.com/info/littleCMS/littleCMS_PreMulAlphaHack.zip

(as you may not use premultiplied alpha support, I suggest you to
undefine the flag I added: CMS_PREMUL_ALPHA_SUPPORT)

Hopes it could help you... (you can use the code I added without any
restrictions).

Best regards

Sebastien

 Hughes wrote:
> Hi all,
>
> I'm trying to make my transform go fast. I've got a 1920x1080 RGB
> image being transformed from sRGB to the display profile. I've got a
> quad core processor on my development box, no shaders or GPU, and I'm
> trying to do the transform as quickly as possible.
>
> I figured the fastest way to do this would be to set up a threadpool
> with max_threads = 4. Then I have a few choices:
>
> * pop a thread from the pool for every line of the image, creating
> local state with p_in, p_out, width and stride
> * pop a thread from the pool for every n lines of the image, creating
> local state with p_in, p_out, width, stride and rows_to_process (where
> n = height / max_threads)
>
> I figured 4 threads should be ~4x faster than using 1 thread (in the
> second case we should only have 4 threads, so not much overhead), but
> no matter the value of max_threads or 'n' I can only achieve a ~1.9x
> speed-up. I've tried with and without cmsFLAGS_NOCACHE. Any pointers
> very welcome.
>
> Thanks,
>
> Richard

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
_______________________________________________
Lcms-user mailing list
Lcms-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lcms-user

Re: [Lcms-user] Threading performance in LCMS2

Reply via email to