Re: [Gimp-developer] GIMP and multiple processors
On Sun, Feb 20, 2005 at 11:52:16PM +, Adam D. Moss wrote: I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads. Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs. So it might make sense not to interleave tiles but let each thread start on another region of the image, like this (for 4 threads) 1xxx 2xxx 3xxx 4xxx So each of the threads works on it's own memory at first and only when it finished it's region, it will get tiles from the remaining work. Some binary-search like algorithm could be used like this: - figure out largest remaining part of the image (at start, it's the whole image) - let x be the number of idle threads - if starting up, divide into x regions assign each thread the start of it's region else divide into x+1 regions assign idle threads the x+1st, x-th, x-1st etc. region (this leaves the first to the thread currently processing it) - start processing the regions - if any thread runs out of work, restart algorithm This should improve memory locality while distributing work across all threads and making sure that non-uniform workloads are also distributed. Consider the above example where thread 4 has finished it's tiles: 333x It looks like 1 will need the longest time to finish it's region, so we jump in there and assign part of it's work to 4 (marked y): xx4y 333x Later, 3 finishes: xx4y Now, 2 has the largest region left, so half of it get's assign to 3 (marked z): xx4y 3zzz The state of affairs can be managed easily since there are never more regions than threads. HTH! Tino. PS: I don't know whether this algorithm works well in practice, I actually got the idea while replying Adam's response. ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer
Re: [Gimp-developer] GIMP and multiple processors
On 21.02.2005, at 03:14, [EMAIL PROTECTED] ( Marc) (A.) (Lehmann ) wrote: Forcing the NPTL implementation to degrade to legacy BTW, is this 2.4 or 2.6? Linux ulli 2.6.9-k8 #31 SMP Wed Nov 3 10:58:29 CET 2004 x86_64 GNU/Linux 32bit userland from Debian unstable. Servus, Daniel PGP.sig Description: This is a digitally signed message part
RE: [Gimp-developer] GIMP and multiple processors
Allocating batches of tiles to each thread and dynamically re-allocating them should a thread finish early will complicate the code and will only provide very marginal speed improvement. (Each thread will have to obtain a lock on a work queue to get its next tile either way but the proposed scheme has a slightly less chance of lock contention since there are effectively multiple queues). As far as task sequencing goes that all depends on how tile memory is allocated. To maximize locality of reference and minimize paging the goal should be to have all threads working on contiguous areas of memory at the same time where possible. -Dave -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tino Schwarze Sent: Monday, 21 February 2005 6:14 PM To: gimp-developer@lists.xcf.berkeley.edu Subject: Re: [Gimp-developer] GIMP and multiple processors On Sun, Feb 20, 2005 at 11:52:16PM +, Adam D. Moss wrote: I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads. Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs. So it might make sense not to interleave tiles but let each thread start on another region of the image, like this (for 4 threads) 1xxx 2xxx 3xxx 4xxx So each of the threads works on it's own memory at first and only when it finished it's region, it will get tiles from the remaining work. Some binary-search like algorithm could be used like this: - figure out largest remaining part of the image (at start, it's the whole image) - let x be the number of idle threads - if starting up, divide into x regions assign each thread the start of it's region else divide into x+1 regions assign idle threads the x+1st, x-th, x-1st etc. region (this leaves the first to the thread currently processing it) - start processing the regions - if any thread runs out of work, restart algorithm This should improve memory locality while distributing work across all threads and making sure that non-uniform workloads are also distributed. Consider the above example where thread 4 has finished it's tiles: 333x It looks like 1 will need the longest time to finish it's region, so we jump in there and assign part of it's work to 4 (marked y): xx4y 333x Later, 3 finishes: xx4y Now, 2 has the largest region left, so half of it get's assign to 3 (marked z): xx4y 3zzz The state of affairs can be managed easily since there are never more regions than threads. HTH! Tino. PS: I don't know whether this algorithm works well in practice, I actually got the idea while replying Adam's response. ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer
Re: [Gimp-developer] GIMP and multiple processors
On Mon, Feb 21, 2005 at 09:14:13AM +0100, Tino Schwarze [EMAIL PROTECTED] wrote: I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads. Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory This is unlikely, as the gimp has no say in the physical layout of the memory it gets from the kernel. Also, interleaving should not have much of an effect, as the blocks are large, so there will be no thrashing between cpus (if hyperthreading is in use, then interleaving would actually be slightly better due to limited cache associativity). It's more likely that gimp is doing something wrong currently, with respect to locking or sth. similar. -- The choice of a -==- _GNU_ ==-- _ generation Marc Lehmann ---==---(_)__ __ __ [EMAIL PROTECTED] --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=/_/_//_/\_,_/ /_/\_\ XX11-RIPE ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer
Re: [Gimp-developer] GIMP and multiple processors
On Mon, 21 Feb 2005 02:00:39 +0100, Sven Neumann [EMAIL PROTECTED] wrote: It sounds like the granularity of parallelism is too fine. That is, each task is too short and the overhead of task dispatching (your task queue processing, the kernels thread context switching, any IPC required, etc.) is longer then the duration of a single task. The task is not a single pixel but a single tile (that is usually a region of 64x64 pixels). GIMP processes pixel regions by iterating over the tiles. The multi-threaded pixel processor uses a configurable number of threads. Each thread obtains a lock on the pixel-region, takes a pointer to the next tile from the queue, releases the lock, processes the tile and starts over. I maintain a threaded image processing library called VIPS. http://www.vips.ecs.soton.ac.uk/ We looked at granularity a while ago and the 'sweet spot' at which the thread start/stop time became insignificant seemed to be around 50x50 pixels. I've pasted some numbers to the end of the mail in case anyone is interested. I realise gimp is using a very different evaluation strategy, but the point (maybe) is that thread manipulation is rather quick and you're probably not seeing it with 64x64 pixel tiles. FWIW, vips works by having a thread pool (rather than a tile queue) and a simple for(;;) loop over tiles. At each tile, the for() loop waits for a thread to become free, then assigns it a tile to work on. The benchmark is a 45 degree rotate of a 4000 by 4000 pixel image. Look for the point at which real time stops falling. The first two arguments to try are the tilesize. The more recent numbers are really too small to be accurate :-( but the benchmark is 10 years old and took minutes back then: oh well. I'm supposed to be getting a quad opteron soon, which will be interesting. Kernel 2.6 would help too no doubt. cima: 1 cpu ultrasparc ./try huysum.hr.v fred.v 10 10 1 20 real 30.1 user 24.5 ./try huysum.hr.v fred.v 20 20 1 20 real 19.2 user 16.9 ./try huysum.hr.v fred.v 30 30 1 20 real 17.8 user 15.4 ./try huysum.hr.v fred.v 40 40 1 20 real 17.1 user 15.1 ./try huysum.hr.v fred.v 50 50 1 20 real 16.9 user 15.1 ./try huysum.hr.v fred.v 60 60 1 20 real 16.6 user 15.0 ./try huysum.hr.v fred.v 70 70 1 20 real 17.2 user 15.2 ./try huysum.hr.v fred.v 80 80 1 20 real 17.3 user 15.1 ./try huysum.hr.v fred.v 90 90 1 20 real 17.4 user 15.3 perugino: 2 cpu supersparc ./try huysum.hr.v fred.v 10 10 1 20 real0m51.123s user1m7.623s ./try huysum.hr.v fred.v 20 20 1 20 real0m24.601s user0m41.133s ./try huysum.hr.v fred.v 30 30 1 20 real0m21.931s user0m38.393s ./try huysum.hr.v fred.v 40 40 1 20 real0m20.208s user0m35.653s ./try huysum.hr.v fred.v 50 50 1 20 real0m20.109s user0m35.283s ./try huysum.hr.v fred.v 60 60 1 20 real0m19.501s user0m34.513s ./try huysum.hr.v fred.v 70 70 1 20 real0m20.435s user0m34.813s ./try huysum.hr.v fred.v 80 80 1 20 real0m20.558s user0m35.293s ./try huysum.hr.v fred.v 90 90 1 20 real0m20.785s user0m35.313s Run on furini, 2 CPU 450MHz PII Xeon, kernel 2.4.4, vips-7.7.19, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real0m4.542s user0m4.350s sys 0m3.800s ./try huysum.hr.v fred.v 20 20 1 20 real0m2.206s user0m2.750s sys 0m1.250s ./try huysum.hr.v fred.v 30 30 1 20 real0m1.678s user0m2.610s sys 0m0.580s ./try huysum.hr.v fred.v 40 40 1 20 real0m1.483s user0m2.460s sys 0m0.410s ./try huysum.hr.v fred.v 50 50 1 20 real0m1.443s user0m2.330s sys 0m0.350s ./try huysum.hr.v fred.v 60 60 1 20 real0m1.385s user0m2.390s sys 0m0.220s ./try huysum.hr.v fred.v 70 70 1 20 real0m1.394s user0m2.460s sys 0m0.150s ./try huysum.hr.v fred.v 80 80 1 20 real0m1.365s user0m2.360s sys 0m0.200s ./try huysum.hr.v fred.v 90 90 1 20 real0m1.393s user0m2.450s sys 0m0.180s Run on manet, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.18, vips-7.8.5, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real0m1.582s user0m1.640s sys 0m1.470s ./try huysum.hr.v fred.v 20 20 1 20
Re: [Gimp-developer] GIMP and multiple processors
On Sun, 2005-02-20 at 23:52 +, Adam D. Moss wrote: Daniel Egger wrote: I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads. Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs. I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221 I think it is more likely that the problem is with sharing some other data between processors (for example the gradient struct?) I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image. Jay Cox [EMAIL PROTECTED] ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer
Re: [Gimp-developer] GIMP and multiple processors
On 21.02.2005, at 19:07, Jay Cox wrote: I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221 Dual Opteron. I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image. Sven, what procedures are currently parallelized? Servus, Daniel PGP.sig Description: This is a digitally signed message part
RE: [Gimp-developer] GIMP and multiple processors
That all depends how it allocates tile memory. If it mallocs tile by tile then they could be anywhere, including separate pages. If it mallocs a large block (system page size) and then divides it into tiles itself then those tiles are *likely* to reside on the same page - which is what matters most for performance if memory is stretched. -Dave -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] ( Marc) (A.) (Lehmann ) Sent: Monday, 21 February 2005 10:45 PM To: gimp-developer@lists.xcf.berkeley.edu Subject: Re: [Gimp-developer] GIMP and multiple processors ... This is unlikely, as the gimp has no say in the physical layout of the memory it gets from the kernel. ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer
Re: [Gimp-developer] GIMP and multiple processors
On Mon, 2005-02-21 at 21:34 +0100, Daniel Egger wrote: On 21.02.2005, at 19:07, Jay Cox wrote: I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221 Dual Opteron. Should only need 64 bytes then. I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image. Sven, what procedures are currently parallelized? My recolection from many years ago is this: All of the tools-color_tools Histogram calculation All compositing(?) including: Layer stack compositing Brush-drawable compositing More recently sven has parallelized: gimp_image_contiguous_region_by_color gimp_drawable_desaturate gradient_fill_region Jay Cox [EMAIL PROTECTED] ___ Gimp-developer mailing list Gimp-developer@lists.xcf.berkeley.edu http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer