Re: [Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-09 Thread egger

On  5 Apr, Kelly Martin wrote:

 Tiles are 64x64 by default, and changing them is a bad idea because it
 makes your .xcf files nontransportable.

 Not to forget that this size is more or less hardcoded.

-- 

Servus,
   Daniel

___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-09 Thread Kelly Martin

On Mon, 9 Apr 2001 23:02:44 +0200 (CEST), [EMAIL PROTECTED] said:

On 5 Apr, Kelly Martin wrote:
Tiles are 64x64 by default, and changing them is a bad idea because
it makes your .xcf files nontransportable.

 Not to forget that this size is more or less hardcoded.

It's a #define, yes.  And because the file format is tightly bound to
the tile size, changing it at compile time breaks XCF save and load in
ways I don't even want to think about.  I don't think that XCF v0 or
v1 specifies the tile size anywhere in the data stream, either.

I tried to modify the XCF loader  saver once to do reblocking, but it
gave me a headache and I moved on to other, more interesting things,
like feeding my cats. :)

Kelly
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-06 Thread Ernst Lippe

I think this is yet another tile-cache problem.

Georg Acher wrote:
 
 On Thu, Apr 05, 2001 at 12:45:52PM -0500, Kelly Martin wrote:
 
  Hm, it does not.  The issue with whirlpinch is that there's only a
  weak locality relationship between destionation pixels (which are
  iterated across the image) and source pixels (which are fetched with
  the pixel fetcher).  I haven't looked too closely at your blocking
 
 That is right, but destination and source for themselves have good locality
 (ie. the next pixel isn't 500 pixels away from the last).
This is an important observation. 

 
  patch, but I suspect that much the same improvement would be had by
  using a pixel region (which respects tiles) to iterate across the
  destination region.
I agree, but the current algorithm writes two rows (top and bottom) at
the same time, so it is not immediately possible to use the standard
pixel region iterator.

 
 That is possible... Is there a filter that definitely uses the pixel region
 stuff? Most filters I have seen only use one row, which may not be enough
 "locally", since it uses only one pixel but has to fill a whole cacheline
 (4/8 pixel). I will try whether I can speed up bumpmap also, since this
 takes also the magical 30s and is much more often used in scripts than the
 whirlpinch module.
 
 I don't know how large a tile is, but since IMHO the major impact of
 blocking seems to come from the CPU cache, I suspect that is too big for
 older CPUs. I have done the whirlpinch blocking thing about three years ago
 (and forgot to send the patch), and tried it on an Alpha21164 and a P5.

I think you're looking in the wrong direction here. Similar to the
bumpmap (see my other message) I strongly suspect that the tile-cache is
too small.
You can check this for yourself by profiling your code. I expect that
there is hardly any difference in processor time between your version
and the original but that there is an important difference in the number
of calls to gimp_tile_get and gimp_tile_put. (Sorry, did not try that
myself).
Requesting tiles from the main Gimp process is pretty expensive, it
involves several context switches and a lot of copying. This is orders
of magnitude slower than copying data within the plugin process from the
tile cache to some other buffer. The whirlpinch plugin uses a cache size
that is only big enough to handle the output tiles, there is no room for
input tiles. This leads to a series of cache misses and horrible
performance. Your modifications improve the cache-hit ratio because it
processes larger chunks of data.
However, it is not a fundamental solution. The easiest short time
solution is to increase the cache size with the number of input tiles
that are needed to compute one row of the output (I don't know if this
can be easily determined).
The more fundamental solution is to rewrite the algorithm so that it
works on a tile by tile basis. In that case the minimum cache size is
only 2 (for the output tiles) plus the number of input tiles that are
needed to compute one output tile. Of course it is better to use a
somewhat larger cache size because adjacent output tiles can use the
same input tiles.



Greetings,

Ernst
[EMAIL PROTECTED]
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



[Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-05 Thread Georg Acher

Hi,
I don't know who's currently "responsible" for the whirlpinch plugin, so I
post my patch to this list.

I have modified whirlpinch slightly to use "blocking", ie. doing all
calculations in small squares (32*32). With that technique very common in
numerical computing, the CPU caches (and for GIMP) the tile cache have a much
higher hit rate. 
The boost is quite spectacular: The original whirlpinch on a larger image 
(1400*1400) needs on a Athlon-600 30s to complete, with my patch only 6.5s.
That's a speedup by a factor of 4.5 without any change in the algorithm
itself!

The changes are relatively small (effectively about 10 lines) and affect 
mostly clipping.

I have found no side efects of the patch...

The blocking can IMHO easily used for a lot of other filters, and should
give a large speedup for most of GIMP's filters.

Please try out the patch and apply it to the source tree if you like it ;-)
-- 
 Georg Acher, [EMAIL PROTECTED] 
 http://www.in.tum.de/~acher/
  "Oh no, not again !" The bowl of petunias  


--- whirlpinch.c.orgThu Apr  5 17:47:17 2001
+++ whirlpinch.cThu Apr  5 18:49:09 2001
@@ -22,6 +22,14 @@
  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
  */
 
+/* Version 2.10:
+ *
+ * Major Speedup by use of "blocking", ie. doing the calcualations
+ * in small squares, thus gaining a performance boost from CPU caches
+ * and the tile cache.
+ *
+ * Georg Acher, [EMAIL PROTECTED]
+ */
 
 /* Version 2.09:
  *
@@ -63,7 +71,7 @@
 
 
 #define PLUG_IN_NAME"plug_in_whirl_pinch"
-#define PLUG_IN_VERSION "May 1997, 2.09"
+#define PLUG_IN_VERSION "April 2001, 2.10"
 
 /* Magic numbers */
 
@@ -71,6 +79,10 @@
 #define SCALE_WIDTH  200
 #define ENTRY_WIDTH  60
 
+/* blocking size, 32*32pixels is a good compromise for all CPUs */
+
+#define BLOCKING 32
+
 /* Types */
 
 typedef struct
@@ -366,12 +378,13 @@
   guchar  *top_row, *bot_row;
   guchar  *top_p, *bot_p;
   gint row, col;
+  gint row1,col1;
   guchar   pixel[4][4];
   guchar   values[4];
   double   whirl;
   double   cx, cy;
   int  ix, iy;
-  int  i;
+  int  i,n;
   guchar   bg_color[4];
   pixel_fetcher_t *pft, *pfb;
 
@@ -406,112 +419,133 @@
   whirl   = wpvals.whirl * G_PI / 180;
   radius2 = radius * radius * wpvals.radius;
 
-  for (row = sel_y1; row = ((sel_y1 + sel_y2) / 2); row++)
+  /* WhirlPinch in small squares to benefit from cache effects
+  (tile cache, CPU cache) 
+  20010405 GA
+  */
+  for (row1 = sel_y1; row1 = ((sel_y1 + sel_y2) / 2); row1+=BLOCKING)
 {
-  top_p = top_row;
-  bot_p = bot_row + img_bpp * (sel_width - 1);
-
-  for (col = sel_x1; col  sel_x2; col++)
-   {
- if (calc_undistorted_coords (col, row, whirl, wpvals.pinch, cx, cy))
-   {
- /* We are inside the distortion area */
-
- /* Top */
-
- if (cx = 0.0)
-   ix = (int) cx;
- else
-   ix = -((int) -cx + 1);
-
- if (cy = 0.0)
-   iy = (int) cy;
- else
-   iy = -((int) -cy + 1);
-
- pixel_fetcher_get_pixel (pft, ix, iy, pixel[0]);
- pixel_fetcher_get_pixel (pft, ix + 1, iy, pixel[1]);
- pixel_fetcher_get_pixel (pft, ix, iy + 1, pixel[2]);
- pixel_fetcher_get_pixel (pft, ix + 1, iy + 1, pixel[3]);
-
- for (i = 0; i  img_bpp; i++)
-   {
- values[0] = pixel[0][i];
- values[1] = pixel[1][i];
- values[2] = pixel[2][i];
- values[3] = pixel[3][i];
-
- *top_p++ = bilinear (cx, cy, values);
-   }
-
- /* Bottom */
-
- cx = cen_x + (cen_x - cx);
- cy = cen_y + (cen_y - cy);
-
- if (cx = 0.0)
-   ix = (int) cx;
- else
-   ix = -((int) -cx + 1);
-
- if (cy = 0.0)
-   iy = (int) cy;
- else
-   iy = -((int) -cy + 1);
-
- pixel_fetcher_get_pixel (pfb, ix, iy, pixel[0]);
- pixel_fetcher_get_pixel (pfb, ix + 1, iy, pixel[1]);
- pixel_fetcher_get_pixel (pfb, ix, iy + 1, pixel[2]);
- pixel_fetcher_get_pixel (pfb, ix + 1, iy + 1, pixel[3]);
-
- for (i = 0; i  img_bpp; i++)
-   {
- values[0] = pixel[0][i];
- values[1] = pixel[1][i];
- values[2] = pixel[2][i];
- values[3] = pixel[3][i];
-
- *bot_p++ = bilinear (cx, cy, values);
-   }
-
- bot_p -= 2 * img_bpp; /* We move backwards! */
-   }
- else
-   {
- /*  We are outside the distortion area;
-  *  just 

[Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-05 Thread Kelly Martin

On Thu, 5 Apr 2001 12:36:05 -0500, Kelly Martin [EMAIL PROTECTED] said:

I have modified whirlpinch slightly to use "blocking", ie. doing
all calculations in small squares (32*32). With that technique very
common in numerical computing, the CPU caches (and for GIMP) the
tile cache have a much higher hit rate. The boost is quite
spectacular: The original whirlpinch on a larger image (1400*1400)
needs on a Athlon-600 30s to complete, with my patch only 6.5s.
That's a speedup by a factor of 4.5 without any change in the
algorithm itself!

I was under the impression that whirlpinch used tile regions, which
should do blocking automatically.  It's been a couple years since I
looked at the code, though.

Hm, it does not.  The issue with whirlpinch is that there's only a
weak locality relationship between destionation pixels (which are
iterated across the image) and source pixels (which are fetched with
the pixel fetcher).  I haven't looked too closely at your blocking
patch, but I suspect that much the same improvement would be had by
using a pixel region (which respects tiles) to iterate across the
destination region.  

Whirlpinch also has an optimization to do the top and bottom whirls
simultaneously to save calculations.  This might actually be more of a
loss because of the locality hit on the tile cache.

If I get some spare time, I might look at this more closely.

Kelly
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] [patch] Major speedup for whirlpinch plugin

2001-04-05 Thread Georg Acher

On Thu, Apr 05, 2001 at 12:45:52PM -0500, Kelly Martin wrote:
 
 Hm, it does not.  The issue with whirlpinch is that there's only a
 weak locality relationship between destionation pixels (which are
 iterated across the image) and source pixels (which are fetched with
 the pixel fetcher).  I haven't looked too closely at your blocking

That is right, but destination and source for themselves have good locality
(ie. the next pixel isn't 500 pixels away from the last).
I have tried whirlpinch with huge distortion values, and the worst case was
8s instead of 6.5s, still better than 30s ;-)

 patch, but I suspect that much the same improvement would be had by
 using a pixel region (which respects tiles) to iterate across the
 destination region.  

That is possible... Is there a filter that definitely uses the pixel region
stuff? Most filters I have seen only use one row, which may not be enough
"locally", since it uses only one pixel but has to fill a whole cacheline
(4/8 pixel). I will try whether I can speed up bumpmap also, since this
takes also the magical 30s and is much more often used in scripts than the
whirlpinch module.

I don't know how large a tile is, but since IMHO the major impact of
blocking seems to come from the CPU cache, I suspect that is too big for
older CPUs. I have done the whirlpinch blocking thing about three years ago
(and forgot to send the patch), and tried it on an Alpha21164 and a P5. 

Both have only very small L1-caches (Alpha has 8K) and relatively slow L2
caches (+L3 for Alpha). The best blocking factor for those CPUs was between
64 and 128. My Athlon now also tolerates 512, 32 only is only slightly (2%)
slower, but should work for all CPUs fairly good.

-- 
 Georg Acher, [EMAIL PROTECTED] 
 http://www.in.tum.de/~acher/
  "Oh no, not again !" The bowl of petunias  
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer