Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread Tino Schwarze
On Sun, Feb 20, 2005 at 11:52:16PM +, Adam D. Moss wrote:

 I can force it to use both CPUs now, but even with
 200% utilization it is 2s slower to run this stupid
 ubenchmark than on 1 CPU without threads.
 
 Just a vague guess, but the multiprocessor GIMP pixel
 work scheduler might* farm alternating tiles to alternating
 CPUs.  These are reasonably likely to have been allocated
 together and thus sit close together in memory, causing
 memory contention between CPUs.

So it might make sense not to interleave tiles but let each thread start
on another region of the image, like this (for 4 threads)

1xxx

2xxx

3xxx

4xxx


So each of the threads works on it's own memory at first and only when
it finished it's region, it will get tiles from the remaining work. Some
binary-search like algorithm could be used like this:

- figure out largest remaining part of the image (at start, it's the
whole image)
- let x be the number of idle threads
- if starting up, 
  divide into x regions
  assign each thread the start of it's region
   else 
  divide into x+1 regions
  assign idle threads the x+1st, x-th, x-1st etc. region (this
leaves the first to the thread currently processing it)
- start processing the regions
- if any thread runs out of work, restart algorithm

This should improve memory locality while distributing work across all
threads and making sure that non-uniform workloads are also distributed.

Consider the above example where thread 4 has finished it's tiles:





333x




It looks like 1 will need the longest time to finish it's region, so we
jump in there and assign part of it's work to 4 (marked y):


xx4y


333x




Later, 3 finishes:


xx4y







Now, 2 has the largest region  left, so half of it get's assign to 3
(marked z):


xx4y

3zzz





The state of affairs can be managed easily since there are never more
regions than threads.

HTH! Tino.

PS: I don't know whether this algorithm works well in practice, I
actually got the idea while replying Adam's response.

___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer


Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread Daniel Egger
On 21.02.2005, at 03:14, [EMAIL PROTECTED] ( Marc) (A.) (Lehmann ) wrote:
Forcing the NPTL implementation to degrade to legacy

BTW, is this 2.4 or 2.6?
Linux ulli 2.6.9-k8 #31 SMP Wed Nov 3 10:58:29 CET 2004 x86_64 GNU/Linux
32bit userland from Debian unstable.
Servus,
  Daniel


PGP.sig
Description: This is a digitally signed message part


RE: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread David Bonnell
Allocating batches of tiles to each thread and dynamically re-allocating
them should a thread finish early will complicate the code and will only
provide very marginal speed improvement.  (Each thread will have to obtain a
lock on a work queue to get its next tile either way but the proposed scheme
has a slightly less chance of lock contention since there are effectively
multiple queues).

As far as task sequencing goes that all depends on how tile memory is
allocated.  To maximize locality of reference and minimize paging the goal
should be to have all threads working on contiguous areas of memory at the
same time where possible.


-Dave

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tino
Schwarze
Sent: Monday, 21 February 2005 6:14 PM
To: gimp-developer@lists.xcf.berkeley.edu
Subject: Re: [Gimp-developer] GIMP and multiple processors

On Sun, Feb 20, 2005 at 11:52:16PM +, Adam D. Moss wrote:

 I can force it to use both CPUs now, but even with
 200% utilization it is 2s slower to run this stupid
 ubenchmark than on 1 CPU without threads.
 
 Just a vague guess, but the multiprocessor GIMP pixel
 work scheduler might* farm alternating tiles to alternating
 CPUs.  These are reasonably likely to have been allocated
 together and thus sit close together in memory, causing
 memory contention between CPUs.

So it might make sense not to interleave tiles but let each thread start
on another region of the image, like this (for 4 threads)

1xxx

2xxx

3xxx

4xxx


So each of the threads works on it's own memory at first and only when
it finished it's region, it will get tiles from the remaining work. Some
binary-search like algorithm could be used like this:

- figure out largest remaining part of the image (at start, it's the
whole image)
- let x be the number of idle threads
- if starting up, 
  divide into x regions
  assign each thread the start of it's region
   else 
  divide into x+1 regions
  assign idle threads the x+1st, x-th, x-1st etc. region (this
leaves the first to the thread currently processing it)
- start processing the regions
- if any thread runs out of work, restart algorithm

This should improve memory locality while distributing work across all
threads and making sure that non-uniform workloads are also distributed.

Consider the above example where thread 4 has finished it's tiles:





333x




It looks like 1 will need the longest time to finish it's region, so we
jump in there and assign part of it's work to 4 (marked y):


xx4y


333x




Later, 3 finishes:


xx4y







Now, 2 has the largest region  left, so half of it get's assign to 3
(marked z):


xx4y

3zzz





The state of affairs can be managed easily since there are never more
regions than threads.

HTH! Tino.

PS: I don't know whether this algorithm works well in practice, I
actually got the idea while replying Adam's response.

___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer

___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer


Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread pcg
On Mon, Feb 21, 2005 at 09:14:13AM +0100, Tino Schwarze [EMAIL PROTECTED] 
wrote:
  I can force it to use both CPUs now, but even with
  200% utilization it is 2s slower to run this stupid
  ubenchmark than on 1 CPU without threads.
  
  Just a vague guess, but the multiprocessor GIMP pixel
  work scheduler might* farm alternating tiles to alternating
  CPUs.  These are reasonably likely to have been allocated
  together and thus sit close together in memory

This is unlikely, as the gimp has no say in the physical layout of the memory
it gets from the kernel.

Also, interleaving should not have much of an effect, as the blocks are
large, so there will be no thrashing between cpus (if hyperthreading is in
use, then interleaving would actually be slightly better due to limited
cache associativity).

It's more likely that gimp is doing something wrong currently, with respect
to locking or sth. similar.

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer


Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread John Cupitt
On Mon, 21 Feb 2005 02:00:39 +0100, Sven Neumann [EMAIL PROTECTED] wrote:
  It sounds like the granularity of parallelism is too fine.  That is,
  each task is too short and the overhead of task dispatching (your
  task queue processing, the kernels thread context switching, any IPC
  required, etc.) is longer then the duration of a single task.
 
 The task is not a single pixel but a single tile (that is usually a
 region of 64x64 pixels). GIMP processes pixel regions by iterating
 over the tiles. The multi-threaded pixel processor uses a configurable
 number of threads. Each thread obtains a lock on the pixel-region,
 takes a pointer to the next tile from the queue, releases the lock,
 processes the tile and starts over.

I maintain a threaded image processing library called VIPS. 

  http://www.vips.ecs.soton.ac.uk/

We looked at granularity a while ago and the 'sweet spot' at which the
thread start/stop time became insignificant seemed to be around 50x50
pixels. I've pasted some numbers to the end of the mail in case anyone
is interested. I realise gimp is using a very different evaluation
strategy, but the point (maybe) is that thread manipulation is rather
quick and you're probably not seeing it with 64x64 pixel tiles.

FWIW, vips works by having a thread pool (rather than a tile queue)
and a simple for(;;) loop over tiles. At each tile, the for() loop
waits for a thread to become free, then assigns it a tile to work on.

The benchmark is a 45 degree rotate of a 4000 by 4000 pixel image.
Look for the point at which real time stops falling. The first two
arguments to try are the tilesize. The more recent numbers are
really too small to be accurate :-( but the benchmark is 10 years old
and took minutes back then: oh well. I'm supposed to be getting a quad
opteron soon, which will be interesting. Kernel 2.6 would help too no
doubt.

cima: 1 cpu ultrasparc
./try huysum.hr.v fred.v 10 10 1 20
real   30.1
user   24.5
./try huysum.hr.v fred.v 20 20 1 20
real   19.2
user   16.9
./try huysum.hr.v fred.v 30 30 1 20
real   17.8
user   15.4
./try huysum.hr.v fred.v 40 40 1 20
real   17.1
user   15.1
./try huysum.hr.v fred.v 50 50 1 20
real   16.9
user   15.1
./try huysum.hr.v fred.v 60 60 1 20
real   16.6
user   15.0
./try huysum.hr.v fred.v 70 70 1 20
real   17.2
user   15.2
./try huysum.hr.v fred.v 80 80 1 20
real   17.3
user   15.1
./try huysum.hr.v fred.v 90 90 1 20
real   17.4
user   15.3

perugino: 2 cpu supersparc
./try huysum.hr.v fred.v 10 10 1 20
real0m51.123s
user1m7.623s
./try huysum.hr.v fred.v 20 20 1 20
real0m24.601s
user0m41.133s
./try huysum.hr.v fred.v 30 30 1 20
real0m21.931s
user0m38.393s
./try huysum.hr.v fred.v 40 40 1 20
real0m20.208s 
user0m35.653s
./try huysum.hr.v fred.v 50 50 1 20
real0m20.109s
user0m35.283s 
./try huysum.hr.v fred.v 60 60 1 20
real0m19.501s
user0m34.513s
./try huysum.hr.v fred.v 70 70 1 20
real0m20.435s
user0m34.813s
./try huysum.hr.v fred.v 80 80 1 20
real0m20.558s
user0m35.293s
./try huysum.hr.v fred.v 90 90 1 20
real0m20.785s
user0m35.313s

Run on furini, 2 CPU 450MHz PII Xeon, kernel 2.4.4, vips-7.7.19, gcc 2.95.3
./try huysum.hr.v fred.v 10 10 1 20
real0m4.542s
user0m4.350s
sys 0m3.800s
./try huysum.hr.v fred.v 20 20 1 20
real0m2.206s
user0m2.750s
sys 0m1.250s
./try huysum.hr.v fred.v 30 30 1 20
real0m1.678s
user0m2.610s
sys 0m0.580s
./try huysum.hr.v fred.v 40 40 1 20
real0m1.483s
user0m2.460s
sys 0m0.410s
./try huysum.hr.v fred.v 50 50 1 20
real0m1.443s
user0m2.330s
sys 0m0.350s
./try huysum.hr.v fred.v 60 60 1 20
real0m1.385s
user0m2.390s
sys 0m0.220s
./try huysum.hr.v fred.v 70 70 1 20
real0m1.394s
user0m2.460s
sys 0m0.150s
./try huysum.hr.v fred.v 80 80 1 20
real0m1.365s
user0m2.360s
sys 0m0.200s
./try huysum.hr.v fred.v 90 90 1 20
real0m1.393s
user0m2.450s
sys 0m0.180s

Run on manet, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.18, vips-7.8.5, gcc 2.95.3
./try huysum.hr.v fred.v 10 10 1 20
real0m1.582s
user0m1.640s
sys 0m1.470s
./try huysum.hr.v fred.v 20 20 1 20
 

Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread Jay Cox
On Sun, 2005-02-20 at 23:52 +, Adam D. Moss wrote:
 Daniel Egger wrote:
  I can force it to use both CPUs now, but even with
  200% utilization it is 2s slower to run this stupid
  ubenchmark than on 1 CPU without threads.
 
 Just a vague guess, but the multiprocessor GIMP pixel
 work scheduler might* farm alternating tiles to alternating
 CPUs.  These are reasonably likely to have been allocated
 together and thus sit close together in memory, causing
 memory contention between CPUs.
 

I'm not sure what system the benchmark is being run on, but the cache
line size on a P4 is 128Byes (most other systems have smaller cache line
sizes).  A simple test to see if this is the problem would be to change
the tile allocation code to allocate an extra 128 bytes of memory per
tile.  See app/base/tile.c line 221

I think it is more likely that the problem is with sharing some other
data between processors (for example the gradient struct?)

I think it would be a good idea to get some timings from some other
operations also.  Perhaps painting with a large brush, or flattening a
complicated image.

Jay Cox
[EMAIL PROTECTED]




___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer


Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread Daniel Egger
On 21.02.2005, at 19:07, Jay Cox wrote:
I'm not sure what system the benchmark is being run on, but the cache
line size on a P4 is 128Byes (most other systems have smaller cache 
line
sizes).  A simple test to see if this is the problem would be to change
the tile allocation code to allocate an extra 128 bytes of memory per
tile.  See app/base/tile.c line 221
Dual Opteron.
I think it would be a good idea to get some timings from some other
operations also.  Perhaps painting with a large brush, or flattening a
complicated image.
Sven, what procedures are currently parallelized?
Servus,
  Daniel


PGP.sig
Description: This is a digitally signed message part


RE: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread David Bonnell
That all depends how it allocates tile memory.  If it mallocs tile by tile
then they could be anywhere, including separate pages.  If it mallocs a
large block (system page size) and then divides it into tiles itself then
those tiles are *likely* to reside on the same page - which is what matters
most for performance if memory is stretched.


-Dave

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
[EMAIL PROTECTED] ( Marc) (A.) (Lehmann )
Sent: Monday, 21 February 2005 10:45 PM
To: gimp-developer@lists.xcf.berkeley.edu
Subject: Re: [Gimp-developer] GIMP and multiple processors

...

This is unlikely, as the gimp has no say in the physical layout of the
memory
it gets from the kernel.


___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer


Re: [Gimp-developer] GIMP and multiple processors

2005-02-21 Thread Jay Cox
On Mon, 2005-02-21 at 21:34 +0100, Daniel Egger wrote:
 On 21.02.2005, at 19:07, Jay Cox wrote:
 
  I'm not sure what system the benchmark is being run on, but the cache
  line size on a P4 is 128Byes (most other systems have smaller cache 
  line
  sizes).  A simple test to see if this is the problem would be to change
  the tile allocation code to allocate an extra 128 bytes of memory per
  tile.  See app/base/tile.c line 221
 
 Dual Opteron.

Should only need 64 bytes then.

 
  I think it would be a good idea to get some timings from some other
  operations also.  Perhaps painting with a large brush, or flattening a
  complicated image.
 
 Sven, what procedures are currently parallelized?
 


My recolection from many years ago is this:  
All of the tools-color_tools
Histogram calculation
All compositing(?) including:
Layer stack compositing
Brush-drawable compositing

More recently sven has parallelized:
gimp_image_contiguous_region_by_color
gimp_drawable_desaturate
gradient_fill_region

Jay Cox
[EMAIL PROTECTED]


___
Gimp-developer mailing list
Gimp-developer@lists.xcf.berkeley.edu
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer