On Fri, Jun 26, 2015 at 02:50:45AM +0300, Roman Lebedev wrote:
> There is also a second way (not counting leaving it as it is),
> and i think there are valid arguments why it should be chosen: OpenMP SIMD.

I agree that would be the ideal, but I have several issues below.

I had looked at that and for some reason decided it wasn't a real option
until GCC 5.0 comes out. I now see GCC 4.9 has support for that, which I
have on both my development systems. If clang doesn't have it yet, it
will soon as well (it's completed, but I don't know off hand if it's in
the latest stable version).

That would, however, limit deployment to only the newest versions of
most Linux OS (Ubuntu 15.04, Fedora 21, Gentoo only with manual
unmasking).

> 1. more versions of process() - more code to keep synced Right now we
> already have process() with SSE[3] and process_cl() with opencl And
> even now, there is no checking whether they produce same results...

This has bothered me too. I have wished there was some way of just
running a single IOP for testing, benchmarking, and verification.

> 3. AVX is not the last and fastest set (as in, there will be more),
> AVX-512 is already planned

That is true.

The flip side is that getting vectorization to work as efficiently as
hand-tuned code may require extra data copies to reorganize the data (ie
interleaved RGB/Lab into planar) which may eat some of the gains. Maybe
compilers are smart enough, but I don't have high hopes for that.

Now, the copies may be needed to take advantage of AVX anyways -- most
of the SSE code is built on the assumption of being able to put one
pixel in a register (R G B blank), which frequently requires large
changes to scale up.

I'll have to do some experimenting to see just how smart GCC is.

> I propose to not take this route of adding yet more diversity, but do
> the directly opposite thing: Add process_simd(), which will have
> absolutely zero intrinsics, but will exploit OpenMP 4.0 SIMD.

If this is of benefit, why not make the base process() use it? That is,
why a third function at all?

In fact, it could potentially (long term) eliminate the need for
process_cl() as well, with compilers gaining support for offloading work
to accelerators. As I understand it, this is a big push for AMD's APUs.

> This way, we will have an easy-to-read version of the code, that will
> be compilable and will work on a CPU with any extension set (even ARM
> probably, not that we care at all about it) and, given that the
> compiler supports OpenMP 4.0, will automatically be using the best
> intrinsics set available on the machine in question.

Will it? Or will it only be using the best intrinsics set on the system
it was built, which will of necessity need to be the least common
denominator for binary distributions? That is, do any of the existing
compilers build multiple code paths and choose at runtime based on
processor features? I've not heard of that, but I'd love to be
corrected.

-- 
Bruce Guenter <br...@untroubled.org>                http://untroubled.org/

Attachment: signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
darktable-devel mailing list
darktable-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/darktable-devel

Reply via email to