Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-07-06 Thread Siarhei Siamashka
On Thu, Jul 5, 2012 at 10:22 AM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 Separable scaling is good idea, but it is not a silver bullet.
 Downscaling is still a valid use case, and separable scaling would
 provide no reduction for the number of arithmetic operations for it.
 Also x86 SSSE3 and ARM NEON add some extra challenges:
 * Using 8-bit multiplications for horizontal interpolation is
 difficult as the weight factors need to be updated for each pixel.
 Single pass scaling can easily use 8-bit multiplications for vertical
 interpolation as the weight factors are pre-calculated before
 entering loop.
 * Separable scaling needs extra load/store instructions to save
 temporary data between passes
 * When we are approaching the memory speed barrier, the separation of
 operations into passes may result in uneven usage of memory subsystem.

About this last part and the uneven memory bandwidth usage. The
firefox-fishtank trace can be used as a good example. This trace uses
bilinear over__8_ operation. Bilinear over__8_ has a
special single pass fast path for ARM NEON now. But we can also
decompose it into bilinear src__ + unscaled over__8_
and get the same result in two passes with a tweaked variant of
http://hg.mozilla.org/mozilla-central/rev/87dbb95cde7d

=== ARM Cortex-A8 1GHz (DM3730), no monitor connected ===

lowlevel-blt-bench (the numbers are MPix/s):
  unscaled src__=  L1: 648.29  L2: 371.80  M:127.12
  bilinear src__=  L1: 102.88  L2:  91.11  M: 80.57
  unscaled over__8_ =  L1: 167.87  L2: 157.61  M: 59.70

  1-pass bilinear over__8_ =  L1:  55.27  L2:  50.00  M: 42.42
  2-pass bilinear over__8_ =  L1:  63.50  L2:  58.10  M: 35.33

cairo-perf-trace (the numbers are seconds):
  1-pass:  image  firefox-fishtank  343.751  344.139 0.08%6/6
  2-pass:  image  firefox-fishtank  362.394  364.273 0.31%6/6

=== ARM Cortex-A8 1GHz (DM3730), 1280x1024-32@57Hz monitor ===

lowlevel-blt-bench (the numbers are MPix/s):
  unscaled src__=  L1: 650.22  L2: 240.76  M: 84.80
  bilinear src__=  L1: 102.11  L2:  90.44  M: 70.74
  unscaled over__8_ =  L1: 168.50  L2: 154.84  M: 46.92

  1-pass bilinear over__8_ =  L1:  55.17  L2:  49.95  M: 37.61
  2-pass bilinear over__8_ =  L1:  63.50  L2:  58.51  M: 31.35

cairo-perf-trace (the numbers are seconds):
  1-pass:  image  firefox-fishtank  404.228  405.542 0.16%6/6
  2-pass:  image  firefox-fishtank  420.228  423.530 0.62%6/6

==

The difference between no monitor and 1280x1024-32@57Hz is that in
the latter case, framebuffer scanout sending the picture 57 times per
second over DVI is draining some of the precious memory bandwidth.

As can be seen from the lowlevel-blt-bench numbers, 2-pass processing
is faster in terms of CPU cycles (~15% speedup) when all the data is in
L1/L2 cache. Even though we need to save/reload data from the temporary
buffer, there are other performance benefits for splitting the
processing in two parts: we have more flexibility selecting the level
of unrolling for each part separately and easier registers allocation.
However, when the working set is exceeding cache sizes (M: number in
lowlevel-blt-bench), we see ~20% performance loss.

Let's look at the cairo-perf-trace results as a more realistic
benchmark next. The 2-pass processing is also a performance loss
there (but just ~4-5%).

The root cause of the problem in this particular case is that
bilinear src__ (reading from RAM, writing to a
temporary buffer in L1 cache) is heavy on computations. And
unscaled over__8_ is memory heavy (reading the scaled
source pixels from L1 cache, but mask and destination are in RAM).

If we were using pixman general pipeline for this operation (with a
SIMD optimized bilinear fetcher), we would also have mask expansion
from 8-bit to 32-bit happening in a temporary buffer, which both
adds extra computations and makes memory access pattern worse. But
I have no performance data to support this claim yet.

That said, caring about these implementation details only can give
us a few extra percents of performance. But hitting a slow path is
always a big performance disaster. Generalizing bilinear scaling
support so that there are no slow paths is more important.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-07-05 Thread Siarhei Siamashka
On Sun, Jul 1, 2012 at 12:25 AM, Søren Sandmann sandm...@cs.au.dk
wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar
 jmuizel...@mozilla.com wrote:
 We recently switched to using the same precision as Skia on
 Android. (4 bits for RGBA32 and 2 bits for RGB565)

 This is a good point. If using just 4 or even 2 bits of interpolation
 precision is acceptable for Skia, then maybe the current bilinear
 interpolation precision is really excessive in pixman. It would be
 too generous to give a handicap to Skia :)

 So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit
 interpolation, it has an advantage that all the calculations need
 only 16-bit unsigned variables. This is faster for at least C, x86
 SSE2 and ARM NEON code.

 Four bits is also the precision that GdkPixbf has used for many years
 with no complaints from users that I'm aware of. If you zoom more than
 16x with GdkPixbuf, banding artefacts definitely start showing up, but
 zooming that far is unlikely to look good with bilinear filtering
 anyway. So dropping to four bits doesn't sound too bad to me.

OK, we'll see how much performance can be gained by going to lower
precision. There is only way to find out: implement different variants
and benchmark them against each other.

 That said, I'm a little concerned that nobody is trying separable
 scaling 

Well, you are trying. So this does not qualify as nobody ;)

 instead of worrying about these microoptimizations.

It's not instead, but in addition to. All these optimizations are
independent from each other and can be used together with some
minor adaptation.

Separable scaling is good idea, but it is not a silver bullet.
Downscaling is still a valid use case, and separable scaling would
provide no reduction for the number of arithmetic operations for it.
Also x86 SSSE3 and ARM NEON add some extra challenges:
* Using 8-bit multiplications for horizontal interpolation is difficult
as the weight factors need to be updated for each pixel. Single pass
scaling can easily use 8-bit multiplications for vertical interpolation
as the weight factors are pre-calculated before entering loop.
* Separable scaling needs extra load/store instructions to save
temporary data between passes
* When we are approaching the memory speed barrier, the separation of
operations into passes may result in uneven usage of memory subsystem.

Still, for example on ARMv6 without real SIMD, it seems to be difficult
to implement both vertical and bilinear horizontal interpolation in one
pass. There are just not enough general purpose registers to
sufficiently unroll the loop to get rid of pipeline stalls and do this
without spilling temporary data to memory or reloading constants. So
the support for separable scaling is a very welcome feature for
microoptimizations.

There is no need to put all eggs into one basket. Having multiple
scaling methods available should not be a problem, each tuned for
different use cases and different target architectures.

 As far as I know Skia does this as well, as does pretty much everybody
 who cares about fast scaling.

 That is, the two closest source lines are scaled horizontally, the
 result is cached, and then the intermediate destination lines can be
 generated by just interpolating vertically. This cuts down the amount
 of arithmetic required quite substantially, especially for scale
 factors larger than 2.

 Here is an example.

   http://cgit.freedesktop.org/~sandmann/pixman/log/?h=separable-bilinear

 Performance results for fishtank with SSE2 and MMX disabled:

   Before:
   [  0]image firefox-fishtank  603.804  603.804   0.00%

   After:
   [  0]image firefox-fishtank  535.321  535.321   0.00%

 And this is with fishtank, which is exclusively downscalings so each
 line is reused at most once.

Looks good to me.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-07-05 Thread Siarhei Siamashka
On Thu, 05 Jul 2012 12:25:55 -0700
Bill Spitzak spit...@gmail.com wrote:

 On 07/05/2012 12:22 AM, Siarhei Siamashka wrote:
 
  Separable scaling is good idea, but it is not a silver bullet.
  Downscaling is still a valid use case, and separable scaling would
  provide no reduction for the number of arithmetic operations for it.
 
 This is not true. In fact 2-pass scaling is a much *bigger* win for 
 downscaling if you use filters of size greater than 1. And filters 
 larger than 1 are absolutely required to fix the artifacts that are
 in current downscaling.

Looks like you got it backwards. Let's do some calculations. For the
sake of simplicity, let's assume that MxM image is scaled to NxN size.

In the case of single pass bilinear scaling, we fetch 2x2 pixel block
from the source image for each destination pixel (total N * N times) and
perform 1 vertical interpolation followed by 1 horizontal
interpolation for it. This is total N * N + N * N interpolation
operations (let's ignore the fact that horizontal interpolation is a
little bit more expensive).

In the case of 2-pass scaling, typically all the pixels from the
source image are participating in calculations (unless M = 2 * N,
which is 2x+ downscaling). So the number of horizontal interpolation
operations is M * N (each source scanline gets horizontally scaled
to size N and cached). The number of vertical interpolation
operations is N * N (we are doing it for every destination pixel
on the second pass). So the total number of interpolation operations
is M * N + N * N for 2-pass bilinear scaling.

Now we only need to compare N * N + N * N with M * N + N * N. In
the case of downscaling (N  M), 2-pass scaling needs more arithmetic
operations. Now let's remember that horizontal interpolation is more
expensive (horizontal weights need to be recalculated every time) and
storing/reloading data between passes adds more overhead, which makes
the outcome even worse for 2-pass scaling.

 Single-pass bilinear is *not* a big loss for upscaling, because as
 you noticed there is overhead of the intermediate values. It is
 pretty much equivalent to stretching the image in one direction, and
 then in the other. You need to store this intermediate stretched
 image, which is bigger than the original.

Increasing the scale factor for upscaling is going to eventually let the
2-pass algorithm outperform its single pass counterpart. There must
be a crossover point somewhere, and it can be used to make a decision
which algorithm to select for better performance in each particular
case.

 What I propose is that an integer down-scale be selected and used to 
 box-filter which can be computed extremely fast (there are no varying 
 weights). This resulting image, which will always be smaller, and
 could be cached, can then be bilinear interpolated using the current
 code.

That's a totally different topic (image quality for downscaling), which
has been already discussed way too much.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-07-05 Thread Siarhei Siamashka
On Fri, 6 Jul 2012 00:08:46 +0300
Siarhei Siamashka siarhei.siamas...@gmail.com wrote:

 On Thu, 05 Jul 2012 12:25:55 -0700
 Bill Spitzak spit...@gmail.com wrote:
 
  On 07/05/2012 12:22 AM, Siarhei Siamashka wrote:
  
   Separable scaling is good idea, but it is not a silver bullet.
   Downscaling is still a valid use case, and separable scaling would
   provide no reduction for the number of arithmetic operations for
   it.
  
  This is not true. In fact 2-pass scaling is a much *bigger* win for 
  downscaling if you use filters of size greater than 1. And filters 
  larger than 1 are absolutely required to fix the artifacts that are
  in current downscaling.
 
 Looks like you got it backwards. Let's do some calculations. For the
 sake of simplicity, let's assume that MxM image is scaled to NxN size.
 
 In the case of single pass bilinear scaling, we fetch 2x2 pixel block
 from the source image for each destination pixel (total N * N times)
 and perform 1 vertical interpolation followed by 1 horizontal
 interpolation for it. This is total N * N + N * N interpolation
 operations (let's ignore the fact that horizontal interpolation is a
 little bit more expensive).

Actually I stand corrected. If we are fetching 2x2 pixel block, then
we need to interpolate 2 pairs of pixels first in one direction.
And then do one more interpolation in another direction. So it is
N * N + N * N + N * N interpolation operations in plain C.
My mistake was getting too much reliant on SIMD. And at least
for 128-bit wide SIMD, it does not matter whether we are
interpolating one pair of pixels or two pairs at once.

 In the case of 2-pass scaling, typically all the pixels from the
 source image are participating in calculations (unless M = 2 * N,
 which is 2x+ downscaling). So the number of horizontal interpolation
 operations is M * N (each source scanline gets horizontally scaled
 to size N and cached). The number of vertical interpolation
 operations is N * N (we are doing it for every destination pixel
 on the second pass). So the total number of interpolation operations
 is M * N + N * N for 2-pass bilinear scaling.
 
 Now we only need to compare N * N + N * N with M * N + N * N.

In fact the comparison turns into 3 * N * N vs. (M + N) * N
operations, which is clearly more favourable for 2-pass processing
in plain C for a wide range of scaling factors.

 In the case of downscaling (N  M), 2-pass scaling needs more arithmetic
 operations. Now let's remember that horizontal interpolation is more
 expensive (horizontal weights need to be recalculated every time) and
 storing/reloading data between passes adds more overhead, which makes
 the outcome even worse for 2-pass scaling.
 
  Single-pass bilinear is *not* a big loss for upscaling, because as
  you noticed there is overhead of the intermediate values. It is
  pretty much equivalent to stretching the image in one direction, and
  then in the other. You need to store this intermediate stretched
  image, which is bigger than the original.
 
 Increasing the scale factor for upscaling is going to eventually let
 the 2-pass algorithm outperform its single pass counterpart. There
 must be a crossover point somewhere, and it can be used to make a
 decision which algorithm to select for better performance in each
 particular case.

Still the underlying instruction set does matter a lot. We can't simply
ignore the existence of wide SIMD instruction sets.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-07-05 Thread Søren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 In the case of 2-pass scaling, typically all the pixels from the
 source image are participating in calculations (unless M = 2 * N,
 which is 2x+ downscaling). So the number of horizontal interpolation
 operations is M * N (each source scanline gets horizontally scaled
 to size N and cached). The number of vertical interpolation
 operations is N * N (we are doing it for every destination pixel
 on the second pass). So the total number of interpolation operations
 is M * N + N * N for 2-pass bilinear scaling.
 
 Now we only need to compare N * N + N * N with M * N + N * N.

 In fact the comparison turns into 3 * N * N vs. (M + N) * N
 operations, which is clearly more favourable for 2-pass processing
 in plain C for a wide range of scaling factors.

There may be some confusion here regarding 1-pass and 2-pass. The
algorithm I'm advocating makes only one pass, but keeps a cache of two
horizontally expanded scanlines. For big downscalings, this does at most
2 * N horizontal interpolations, since in the worst case, processing one
destination scanline will require expanding two source scanlines. So the
real comparison is something like

 3 * N * Nvs.   MIN (2 * N, M) * N + N * N

which is never worse than the 1-pass algorithm in terms of arithmetic,
though I agree it has more overhead in other places.

A true 2-pass algorithm that first expands _all_ scanlines horizontally
would indeed always generate M * N horizontal interpolations, but I
don't think such an approach is a good idea (at least not for a simple
bilinear filter).

  Single-pass bilinear is *not* a big loss for upscaling, because as
  you noticed there is overhead of the intermediate values. It is
  pretty much equivalent to stretching the image in one direction, and
  then in the other. You need to store this intermediate stretched
  image, which is bigger than the original.
 
 Increasing the scale factor for upscaling is going to eventually let
 the 2-pass algorithm outperform its single pass counterpart. There
 must be a crossover point somewhere, and it can be used to make a
 decision which algorithm to select for better performance in each
 particular case.

 Still the underlying instruction set does matter a lot. We can't simply
 ignore the existence of wide SIMD instruction sets.

There is no reason the horizontal interpolations couldn't be done with
wide SIMD as well. You would keep two fixed point coordinates around in
a register and then compute the left/right weights based on them in a
format suitable for pmaddubsw.

(The issue with 128/0 weights is actually not as bad as I thought. When
the weights are 128 and 0, pmaddubsw will treat 128 as -128 and compute

-128 * l + 0 * r

which produces a negative number. Since this is the *only* negative
number that can be produced, and what we really want is a multiplication
with 128, a simple pabsw (also available in SSSE3) will fix up the
result).


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-06-30 Thread Siarhei Siamashka
On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar jmuizel...@mozilla.com wrote:

 On 2012-06-25, at 7:44 PM, Siarhei Siamashka wrote:

 These are the test patches for switching to 7-bit bilinear
 interpolation precisions. The first patch makes bilinear precision
 configurable. The second patch tweaks SSE2 bilinear scaler for better
 performance using PMADDWD instruction. Both should be applied after:
    http://lists.freedesktop.org/archives/pixman/2012-June/002074.html

 We recently switched to using the same precision as Skia on Android. (4 bits 
 for RGBA32 and 2 bits for RGB565)

This is a good point. If using just 4 or even 2 bits of interpolation
precision is acceptable for Skia, then maybe the current bilinear
interpolation precision is really excessive in pixman. It would be too
generous to give a handicap to Skia :)

So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit
interpolation, it has an advantage that all the calculations need only
16-bit unsigned variables. This is faster for at least C, x86 SSE2 and
ARM NEON code.

Reducing interpolation precision is simple, because the existing code
still can work correctly with lower precision assuming that it is
using BILINEAR_INTERPOLATION_BITS constant for shifts and bitmasks.
But going back is difficult after the code is optimized to take
advantage of lower precision for more speed.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-06-30 Thread Søren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar jmuizel...@mozilla.com 
 wrote:

 On 2012-06-25, at 7:44 PM, Siarhei Siamashka wrote:

 These are the test patches for switching to 7-bit bilinear
 interpolation precisions. The first patch makes bilinear precision
 configurable. The second patch tweaks SSE2 bilinear scaler for better
 performance using PMADDWD instruction. Both should be applied after:
    http://lists.freedesktop.org/archives/pixman/2012-June/002074.html

 We recently switched to using the same precision as Skia on Android. (4 bits 
 for RGBA32 and 2 bits for RGB565)

 This is a good point. If using just 4 or even 2 bits of interpolation
 precision is acceptable for Skia, then maybe the current bilinear
 interpolation precision is really excessive in pixman. It would be too
 generous to give a handicap to Skia :)

 So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit
 interpolation, it has an advantage that all the calculations need only
 16-bit unsigned variables. This is faster for at least C, x86 SSE2 and
 ARM NEON code.

Four bits is also the precision that GdkPixbf has used for many years
with no complaints from users that I'm aware of. If you zoom more than
16x with GdkPixbuf, banding artefacts definitely start showing up, but
zooming that far is unlikely to look good with bilinear filtering
anyway. So dropping to four bits doesn't sound too bad to me.

That said, I'm a little concerned that nobody is trying separable
scaling instead of worrying about these microoptimizations. As far as I
know Skia does this as well, as does pretty much everybody who cares
about fast scaling.

That is, the two closest source lines are scaled horizontally, the
result is cached, and then the intermediate destination lines can be
generated by just interpolating vertically. This cuts down the amount of
arithmetic required quite substantially, especially for scale factors
larger than 2.

Here is an example.

  http://cgit.freedesktop.org/~sandmann/pixman/log/?h=separable-bilinear

Performance results for fishtank with SSE2 and MMX disabled:

  Before:
  [  0]image firefox-fishtank  603.804  603.804   0.00%

  After:
  [  0]image firefox-fishtank  535.321  535.321   0.00%

And this is with fishtank, which is exclusively downscalings so each
line is reused at most once.


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-06-27 Thread Søren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 There is also one thing to investigate. It was likely also a problem
 earlier, but may become worse when going to lower interpolation
 precision. Right now the position between pixels is rounded down by
 throwing away lowest bits when converting it to bilinear weight. It
 may potentially cause image drift to the left/upper side if the image
 is scaled up/down repeatedly really a lot. Maybe pixman_fixed_1 / 2
 constant needs to be adjusted here (and in the other places too) to
 ensure more correct rounding:
 
 http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n811
 But I guess it's better to write a test program to check whether such
 effect exists.

The pixman_fixed_1 / 2 constant actually has to do with coordinate
systems. If pixels are considered squares, then pixman's coordinate
system is such that (0.5, 0.5) is the center of the top left pixel. Ie.,
the asterixes below have coordinates at the half-integers.

(0, 0)
   +--
   | * | * |  
   +---+---+-
   | * | * |
   +---+---+
   |   |

When compositing, pixman is given a rectangle (dst_x, dst_y, width,
height) with integer coordinates. To compute the result for the top left
pixel, in principle, pixman should do this:

   1. Add 0.5, 0.5 to dst_x, dst_y to get the coordinate of the first
  pixel
   2. Transform those coordinates to get a source location
   3. If a nearest filter is used, round to closest half-integer (down if
  they are equal)
   4. Subtract 0.5, 0.5 to get x and y integers that can be used to
  compute an index into the bits[] array.

For the non-transformed blitters, all of this disappears and becomes
just an offset that can be computed upfront. But when there is a
transformation, the full computation must be carried out. The
pixman_fixed_1/2 constant used all over the place is just the 0.5 in
step 1.

For the nearest filter, step 3 and step 4 can be simplified to simply
subtracting pixman_fixed_e (to ensure we round down), and then
truncating.

For bilinear filtering, the question of rounding down is mostly
irrelevant because we need both the pixel below and above. In this case
step 3 and 4 become simply a truncation.

The question of rounding the bilinear interpolation location is
separate. It's true that the truncation of the fraction can lead to
shifts of a 128th of pixel, and maybe this should be fixed. 

A simple solution would be to just add 0x80 or 0x40 before shifting, but
it might be possible do that only once per scanline, or even once per
composite operation for scalings. I doubt that it can be done easily by
modifying the pixman_fixed_1/2 constant though.


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-06-27 Thread Søren Sandmann
Søren Sandmann sandm...@cs.au.dk writes:

 For bilinear filtering, the question of rounding down is mostly
 irrelevant because we need both the pixel below and above. In this case
 step 3 and 4 become simply a truncation.

This is not true actually. To get the lower pixel, 0.5 is subtracted,
and then the position is truncated. But the question of rounding down or
up is irrelevant since when the location is exactly on top of a pixel,
the two cases both end up with that pixel having a weight of 1 and the
other (whether below or above) a weight of 0.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

2012-06-26 Thread Siarhei Siamashka
On Tue, Jun 26, 2012 at 6:04 PM, Søren Sandmann sandm...@cs.au.dk wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 Also 7-bit bilinear interpolation precision still allows a nice SSSE3
 optimization, which can be implemented later to get even more speed.

 Presumably you mean pmaddubsw,

Yes

 but isn't it a problem that the weights
 can go from 0 to 128 inclusive? That is, if one weight is 0, the other
 is 128, which means it doesn't actually fit in seven bits. So unless I
 am missing something, pmaddubsw can't be used without a test for 0.

For 8-bit precision we had the same problem with weights going from 0
to 256 inclusive and it is already handled here:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n881

 But the patches need a lot more testing on different hardware.

 Running the test suite on ppc64, ARM, x86-64 with various
 implementations disabled revealed no problems for me.

Thanks. I'll still try to run some more tests and review the code
again to check for any possible leftover magic constants hardcoding 8
bits of bilinear interpolation precision.

There is also one thing to investigate. It was likely also a problem
earlier, but may become worse when going to lower interpolation
precision. Right now the position between pixels is rounded down by
throwing away lowest bits when converting it to bilinear weight. It
may potentially cause image drift to the left/upper side if the image
is scaled up/down repeatedly really a lot. Maybe pixman_fixed_1 / 2
constant needs to be adjusted here (and in the other places too) to
ensure more correct rounding:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n811
But I guess it's better to write a test program to check whether such
effect exists.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman