On Mon, 13 May 2013, Richard Hughes wrote: > > I figured 4 threads should be ~4x faster than using 1 thread (in the > second case we should only have 4 threads, so not much overhead), but > no matter the value of max_threads or 'n' I can only achieve a ~1.9x > speed-up. I've tried with and without cmsFLAGS_NOCACHE. Any pointers > very welcome.
What specific CPU are you using? It would be good to share the ICC profile you are using for testing since it can make a difference. If lcms is only doing indexed lookups for the profile, then memory accesses may be the bottleneck rather than CPU. Are you sharing the same transform (created by one thread), or are you creating an independent transform for each thread (ideally created by the thread which uses it)? Creating the transform can consume considerable time so it can be useful to parallelize (even though it "wastes" CPU) and it help work better given whatever NUMA characteristics pertain to your hardware. Cache-line effects can be significant if there is accidental cache-line sharing (two cores sharing data in the same cache line). Padding structures to prevent false-sharing or using an aligned memory allocator can help surmount such problems. Cache line issues can be very hardware/OS specific and mysterious. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ------------------------------------------------------------------------------ AlienVault Unified Security Management (USM) platform delivers complete security visibility with the essential security capabilities. Easily and efficiently configure, manage, and operate all of your security controls from a single console and one unified framework. Download a free trial. http://p.sf.net/sfu/alienvault_d2d _______________________________________________ Lcms-user mailing list Lcms-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lcms-user