The speedup that you see is probably mainly caused by better caching.
There is a bug in the tile cache size for the plugin.
The cache is under most circumstances too small and this means that
every requested tile is not in the cache and must be transmitted from
the main gimp process (SLOW).
Effectively this means that every tile of the drawable is read and
written gimp_tile_height (normally 64) times. Your modification is
better, I expect that every tile is read and written
gimp_tile_height/BLOCKS = 2 times (but when a selection is used it will
normally be 3 times).
This could be somewhat improved by setting BLOCKS = gimp_tile_height,
when the selection starts at a y-position that is a multiple of
gimp_tile_height (e.g. when there is no selection) every tile will
probably be read/written once, but when the selection starts at a
different position, every tile will still be read/written twice.
I used a different approach that gives speedups similar to yours but
that should read/write every tile only once. The solution is pretty
simple: just enlarge the cache size.
What is the problem with the cache size? The current code uses:
gimp_tile_cache_ntiles (2 * (drawable->width +
gimp_tile_width () - 1) /
The idea here is to cache one row of tiles for the source drawable and
one row for the destination drawable. But this is not enough because
there is no room in the cache for the bitmap tiles! There is a smaller
problem here too, when a selection is used, it is overkill to have a
cache for the full width of the drawable.
I've attached a patch with my modifications. I hope that someone
examines them critically and incorporates them into the distribution.
I prefer my approach because it should give better performance and it
keeps the code cleaner.
Other improvements are still possible. I expect that it should be
possible to rewrite the algorithm such that the tile cache contains only
3 tiles. From what I see the algorithm is the same in the horizontal and
vertical direction. The current implementation uses 3 extra buffer-rows
so when we add 3 extra buffer-columns it should be possible to rewrite
the algorithm so that it processes one tile at a time instead of a full
Thanks for pointing out a pretty big performance problem with the