Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Thu, Jul 5, 2012 at 10:22 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: Separable scaling is good idea, but it is not a silver bullet. Downscaling is still a valid use case, and separable scaling would provide no reduction for the number of arithmetic operations for it. Also x86 SSSE3 and ARM NEON add some extra challenges: * Using 8-bit multiplications for horizontal interpolation is difficult as the weight factors need to be updated for each pixel. Single pass scaling can easily use 8-bit multiplications for vertical interpolation as the weight factors are pre-calculated before entering loop. * Separable scaling needs extra load/store instructions to save temporary data between passes * When we are approaching the memory speed barrier, the separation of operations into passes may result in uneven usage of memory subsystem. About this last part and the uneven memory bandwidth usage. The firefox-fishtank trace can be used as a good example. This trace uses bilinear over__8_ operation. Bilinear over__8_ has a special single pass fast path for ARM NEON now. But we can also decompose it into bilinear src__ + unscaled over__8_ and get the same result in two passes with a tweaked variant of http://hg.mozilla.org/mozilla-central/rev/87dbb95cde7d === ARM Cortex-A8 1GHz (DM3730), no monitor connected === lowlevel-blt-bench (the numbers are MPix/s): unscaled src__= L1: 648.29 L2: 371.80 M:127.12 bilinear src__= L1: 102.88 L2: 91.11 M: 80.57 unscaled over__8_ = L1: 167.87 L2: 157.61 M: 59.70 1-pass bilinear over__8_ = L1: 55.27 L2: 50.00 M: 42.42 2-pass bilinear over__8_ = L1: 63.50 L2: 58.10 M: 35.33 cairo-perf-trace (the numbers are seconds): 1-pass: image firefox-fishtank 343.751 344.139 0.08%6/6 2-pass: image firefox-fishtank 362.394 364.273 0.31%6/6 === ARM Cortex-A8 1GHz (DM3730), 1280x1024-32@57Hz monitor === lowlevel-blt-bench (the numbers are MPix/s): unscaled src__= L1: 650.22 L2: 240.76 M: 84.80 bilinear src__= L1: 102.11 L2: 90.44 M: 70.74 unscaled over__8_ = L1: 168.50 L2: 154.84 M: 46.92 1-pass bilinear over__8_ = L1: 55.17 L2: 49.95 M: 37.61 2-pass bilinear over__8_ = L1: 63.50 L2: 58.51 M: 31.35 cairo-perf-trace (the numbers are seconds): 1-pass: image firefox-fishtank 404.228 405.542 0.16%6/6 2-pass: image firefox-fishtank 420.228 423.530 0.62%6/6 == The difference between no monitor and 1280x1024-32@57Hz is that in the latter case, framebuffer scanout sending the picture 57 times per second over DVI is draining some of the precious memory bandwidth. As can be seen from the lowlevel-blt-bench numbers, 2-pass processing is faster in terms of CPU cycles (~15% speedup) when all the data is in L1/L2 cache. Even though we need to save/reload data from the temporary buffer, there are other performance benefits for splitting the processing in two parts: we have more flexibility selecting the level of unrolling for each part separately and easier registers allocation. However, when the working set is exceeding cache sizes (M: number in lowlevel-blt-bench), we see ~20% performance loss. Let's look at the cairo-perf-trace results as a more realistic benchmark next. The 2-pass processing is also a performance loss there (but just ~4-5%). The root cause of the problem in this particular case is that bilinear src__ (reading from RAM, writing to a temporary buffer in L1 cache) is heavy on computations. And unscaled over__8_ is memory heavy (reading the scaled source pixels from L1 cache, but mask and destination are in RAM). If we were using pixman general pipeline for this operation (with a SIMD optimized bilinear fetcher), we would also have mask expansion from 8-bit to 32-bit happening in a temporary buffer, which both adds extra computations and makes memory access pattern worse. But I have no performance data to support this claim yet. That said, caring about these implementation details only can give us a few extra percents of performance. But hitting a slow path is always a big performance disaster. Generalizing bilinear scaling support so that there are no slow paths is more important. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Sun, Jul 1, 2012 at 12:25 AM, Søren Sandmann sandm...@cs.au.dk wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar jmuizel...@mozilla.com wrote: We recently switched to using the same precision as Skia on Android. (4 bits for RGBA32 and 2 bits for RGB565) This is a good point. If using just 4 or even 2 bits of interpolation precision is acceptable for Skia, then maybe the current bilinear interpolation precision is really excessive in pixman. It would be too generous to give a handicap to Skia :) So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit interpolation, it has an advantage that all the calculations need only 16-bit unsigned variables. This is faster for at least C, x86 SSE2 and ARM NEON code. Four bits is also the precision that GdkPixbf has used for many years with no complaints from users that I'm aware of. If you zoom more than 16x with GdkPixbuf, banding artefacts definitely start showing up, but zooming that far is unlikely to look good with bilinear filtering anyway. So dropping to four bits doesn't sound too bad to me. OK, we'll see how much performance can be gained by going to lower precision. There is only way to find out: implement different variants and benchmark them against each other. That said, I'm a little concerned that nobody is trying separable scaling Well, you are trying. So this does not qualify as nobody ;) instead of worrying about these microoptimizations. It's not instead, but in addition to. All these optimizations are independent from each other and can be used together with some minor adaptation. Separable scaling is good idea, but it is not a silver bullet. Downscaling is still a valid use case, and separable scaling would provide no reduction for the number of arithmetic operations for it. Also x86 SSSE3 and ARM NEON add some extra challenges: * Using 8-bit multiplications for horizontal interpolation is difficult as the weight factors need to be updated for each pixel. Single pass scaling can easily use 8-bit multiplications for vertical interpolation as the weight factors are pre-calculated before entering loop. * Separable scaling needs extra load/store instructions to save temporary data between passes * When we are approaching the memory speed barrier, the separation of operations into passes may result in uneven usage of memory subsystem. Still, for example on ARMv6 without real SIMD, it seems to be difficult to implement both vertical and bilinear horizontal interpolation in one pass. There are just not enough general purpose registers to sufficiently unroll the loop to get rid of pipeline stalls and do this without spilling temporary data to memory or reloading constants. So the support for separable scaling is a very welcome feature for microoptimizations. There is no need to put all eggs into one basket. Having multiple scaling methods available should not be a problem, each tuned for different use cases and different target architectures. As far as I know Skia does this as well, as does pretty much everybody who cares about fast scaling. That is, the two closest source lines are scaled horizontally, the result is cached, and then the intermediate destination lines can be generated by just interpolating vertically. This cuts down the amount of arithmetic required quite substantially, especially for scale factors larger than 2. Here is an example. http://cgit.freedesktop.org/~sandmann/pixman/log/?h=separable-bilinear Performance results for fishtank with SSE2 and MMX disabled: Before: [ 0]image firefox-fishtank 603.804 603.804 0.00% After: [ 0]image firefox-fishtank 535.321 535.321 0.00% And this is with fishtank, which is exclusively downscalings so each line is reused at most once. Looks good to me. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Thu, 05 Jul 2012 12:25:55 -0700 Bill Spitzak spit...@gmail.com wrote: On 07/05/2012 12:22 AM, Siarhei Siamashka wrote: Separable scaling is good idea, but it is not a silver bullet. Downscaling is still a valid use case, and separable scaling would provide no reduction for the number of arithmetic operations for it. This is not true. In fact 2-pass scaling is a much *bigger* win for downscaling if you use filters of size greater than 1. And filters larger than 1 are absolutely required to fix the artifacts that are in current downscaling. Looks like you got it backwards. Let's do some calculations. For the sake of simplicity, let's assume that MxM image is scaled to NxN size. In the case of single pass bilinear scaling, we fetch 2x2 pixel block from the source image for each destination pixel (total N * N times) and perform 1 vertical interpolation followed by 1 horizontal interpolation for it. This is total N * N + N * N interpolation operations (let's ignore the fact that horizontal interpolation is a little bit more expensive). In the case of 2-pass scaling, typically all the pixels from the source image are participating in calculations (unless M = 2 * N, which is 2x+ downscaling). So the number of horizontal interpolation operations is M * N (each source scanline gets horizontally scaled to size N and cached). The number of vertical interpolation operations is N * N (we are doing it for every destination pixel on the second pass). So the total number of interpolation operations is M * N + N * N for 2-pass bilinear scaling. Now we only need to compare N * N + N * N with M * N + N * N. In the case of downscaling (N M), 2-pass scaling needs more arithmetic operations. Now let's remember that horizontal interpolation is more expensive (horizontal weights need to be recalculated every time) and storing/reloading data between passes adds more overhead, which makes the outcome even worse for 2-pass scaling. Single-pass bilinear is *not* a big loss for upscaling, because as you noticed there is overhead of the intermediate values. It is pretty much equivalent to stretching the image in one direction, and then in the other. You need to store this intermediate stretched image, which is bigger than the original. Increasing the scale factor for upscaling is going to eventually let the 2-pass algorithm outperform its single pass counterpart. There must be a crossover point somewhere, and it can be used to make a decision which algorithm to select for better performance in each particular case. What I propose is that an integer down-scale be selected and used to box-filter which can be computed extremely fast (there are no varying weights). This resulting image, which will always be smaller, and could be cached, can then be bilinear interpolated using the current code. That's a totally different topic (image quality for downscaling), which has been already discussed way too much. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Fri, 6 Jul 2012 00:08:46 +0300 Siarhei Siamashka siarhei.siamas...@gmail.com wrote: On Thu, 05 Jul 2012 12:25:55 -0700 Bill Spitzak spit...@gmail.com wrote: On 07/05/2012 12:22 AM, Siarhei Siamashka wrote: Separable scaling is good idea, but it is not a silver bullet. Downscaling is still a valid use case, and separable scaling would provide no reduction for the number of arithmetic operations for it. This is not true. In fact 2-pass scaling is a much *bigger* win for downscaling if you use filters of size greater than 1. And filters larger than 1 are absolutely required to fix the artifacts that are in current downscaling. Looks like you got it backwards. Let's do some calculations. For the sake of simplicity, let's assume that MxM image is scaled to NxN size. In the case of single pass bilinear scaling, we fetch 2x2 pixel block from the source image for each destination pixel (total N * N times) and perform 1 vertical interpolation followed by 1 horizontal interpolation for it. This is total N * N + N * N interpolation operations (let's ignore the fact that horizontal interpolation is a little bit more expensive). Actually I stand corrected. If we are fetching 2x2 pixel block, then we need to interpolate 2 pairs of pixels first in one direction. And then do one more interpolation in another direction. So it is N * N + N * N + N * N interpolation operations in plain C. My mistake was getting too much reliant on SIMD. And at least for 128-bit wide SIMD, it does not matter whether we are interpolating one pair of pixels or two pairs at once. In the case of 2-pass scaling, typically all the pixels from the source image are participating in calculations (unless M = 2 * N, which is 2x+ downscaling). So the number of horizontal interpolation operations is M * N (each source scanline gets horizontally scaled to size N and cached). The number of vertical interpolation operations is N * N (we are doing it for every destination pixel on the second pass). So the total number of interpolation operations is M * N + N * N for 2-pass bilinear scaling. Now we only need to compare N * N + N * N with M * N + N * N. In fact the comparison turns into 3 * N * N vs. (M + N) * N operations, which is clearly more favourable for 2-pass processing in plain C for a wide range of scaling factors. In the case of downscaling (N M), 2-pass scaling needs more arithmetic operations. Now let's remember that horizontal interpolation is more expensive (horizontal weights need to be recalculated every time) and storing/reloading data between passes adds more overhead, which makes the outcome even worse for 2-pass scaling. Single-pass bilinear is *not* a big loss for upscaling, because as you noticed there is overhead of the intermediate values. It is pretty much equivalent to stretching the image in one direction, and then in the other. You need to store this intermediate stretched image, which is bigger than the original. Increasing the scale factor for upscaling is going to eventually let the 2-pass algorithm outperform its single pass counterpart. There must be a crossover point somewhere, and it can be used to make a decision which algorithm to select for better performance in each particular case. Still the underlying instruction set does matter a lot. We can't simply ignore the existence of wide SIMD instruction sets. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
Siarhei Siamashka siarhei.siamas...@gmail.com writes: In the case of 2-pass scaling, typically all the pixels from the source image are participating in calculations (unless M = 2 * N, which is 2x+ downscaling). So the number of horizontal interpolation operations is M * N (each source scanline gets horizontally scaled to size N and cached). The number of vertical interpolation operations is N * N (we are doing it for every destination pixel on the second pass). So the total number of interpolation operations is M * N + N * N for 2-pass bilinear scaling. Now we only need to compare N * N + N * N with M * N + N * N. In fact the comparison turns into 3 * N * N vs. (M + N) * N operations, which is clearly more favourable for 2-pass processing in plain C for a wide range of scaling factors. There may be some confusion here regarding 1-pass and 2-pass. The algorithm I'm advocating makes only one pass, but keeps a cache of two horizontally expanded scanlines. For big downscalings, this does at most 2 * N horizontal interpolations, since in the worst case, processing one destination scanline will require expanding two source scanlines. So the real comparison is something like 3 * N * Nvs. MIN (2 * N, M) * N + N * N which is never worse than the 1-pass algorithm in terms of arithmetic, though I agree it has more overhead in other places. A true 2-pass algorithm that first expands _all_ scanlines horizontally would indeed always generate M * N horizontal interpolations, but I don't think such an approach is a good idea (at least not for a simple bilinear filter). Single-pass bilinear is *not* a big loss for upscaling, because as you noticed there is overhead of the intermediate values. It is pretty much equivalent to stretching the image in one direction, and then in the other. You need to store this intermediate stretched image, which is bigger than the original. Increasing the scale factor for upscaling is going to eventually let the 2-pass algorithm outperform its single pass counterpart. There must be a crossover point somewhere, and it can be used to make a decision which algorithm to select for better performance in each particular case. Still the underlying instruction set does matter a lot. We can't simply ignore the existence of wide SIMD instruction sets. There is no reason the horizontal interpolations couldn't be done with wide SIMD as well. You would keep two fixed point coordinates around in a register and then compute the left/right weights based on them in a format suitable for pmaddubsw. (The issue with 128/0 weights is actually not as bad as I thought. When the weights are 128 and 0, pmaddubsw will treat 128 as -128 and compute -128 * l + 0 * r which produces a negative number. Since this is the *only* negative number that can be produced, and what we really want is a multiplication with 128, a simple pabsw (also available in SSSE3) will fix up the result). Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar jmuizel...@mozilla.com wrote: On 2012-06-25, at 7:44 PM, Siarhei Siamashka wrote: These are the test patches for switching to 7-bit bilinear interpolation precisions. The first patch makes bilinear precision configurable. The second patch tweaks SSE2 bilinear scaler for better performance using PMADDWD instruction. Both should be applied after: http://lists.freedesktop.org/archives/pixman/2012-June/002074.html We recently switched to using the same precision as Skia on Android. (4 bits for RGBA32 and 2 bits for RGB565) This is a good point. If using just 4 or even 2 bits of interpolation precision is acceptable for Skia, then maybe the current bilinear interpolation precision is really excessive in pixman. It would be too generous to give a handicap to Skia :) So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit interpolation, it has an advantage that all the calculations need only 16-bit unsigned variables. This is faster for at least C, x86 SSE2 and ARM NEON code. Reducing interpolation precision is simple, because the existing code still can work correctly with lower precision assuming that it is using BILINEAR_INTERPOLATION_BITS constant for shifts and bitmasks. But going back is difficult after the code is optimized to take advantage of lower precision for more speed. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
Siarhei Siamashka siarhei.siamas...@gmail.com writes: On Tue, Jun 26, 2012 at 6:13 AM, Jeff Muizelaar jmuizel...@mozilla.com wrote: On 2012-06-25, at 7:44 PM, Siarhei Siamashka wrote: These are the test patches for switching to 7-bit bilinear interpolation precisions. The first patch makes bilinear precision configurable. The second patch tweaks SSE2 bilinear scaler for better performance using PMADDWD instruction. Both should be applied after: http://lists.freedesktop.org/archives/pixman/2012-June/002074.html We recently switched to using the same precision as Skia on Android. (4 bits for RGBA32 and 2 bits for RGB565) This is a good point. If using just 4 or even 2 bits of interpolation precision is acceptable for Skia, then maybe the current bilinear interpolation precision is really excessive in pixman. It would be too generous to give a handicap to Skia :) So should we also try 4-bit interpolation? Compared to 7-bit or 8-bit interpolation, it has an advantage that all the calculations need only 16-bit unsigned variables. This is faster for at least C, x86 SSE2 and ARM NEON code. Four bits is also the precision that GdkPixbf has used for many years with no complaints from users that I'm aware of. If you zoom more than 16x with GdkPixbuf, banding artefacts definitely start showing up, but zooming that far is unlikely to look good with bilinear filtering anyway. So dropping to four bits doesn't sound too bad to me. That said, I'm a little concerned that nobody is trying separable scaling instead of worrying about these microoptimizations. As far as I know Skia does this as well, as does pretty much everybody who cares about fast scaling. That is, the two closest source lines are scaled horizontally, the result is cached, and then the intermediate destination lines can be generated by just interpolating vertically. This cuts down the amount of arithmetic required quite substantially, especially for scale factors larger than 2. Here is an example. http://cgit.freedesktop.org/~sandmann/pixman/log/?h=separable-bilinear Performance results for fishtank with SSE2 and MMX disabled: Before: [ 0]image firefox-fishtank 603.804 603.804 0.00% After: [ 0]image firefox-fishtank 535.321 535.321 0.00% And this is with fishtank, which is exclusively downscalings so each line is reused at most once. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
Siarhei Siamashka siarhei.siamas...@gmail.com writes: There is also one thing to investigate. It was likely also a problem earlier, but may become worse when going to lower interpolation precision. Right now the position between pixels is rounded down by throwing away lowest bits when converting it to bilinear weight. It may potentially cause image drift to the left/upper side if the image is scaled up/down repeatedly really a lot. Maybe pixman_fixed_1 / 2 constant needs to be adjusted here (and in the other places too) to ensure more correct rounding: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n811 But I guess it's better to write a test program to check whether such effect exists. The pixman_fixed_1 / 2 constant actually has to do with coordinate systems. If pixels are considered squares, then pixman's coordinate system is such that (0.5, 0.5) is the center of the top left pixel. Ie., the asterixes below have coordinates at the half-integers. (0, 0) +-- | * | * | +---+---+- | * | * | +---+---+ | | When compositing, pixman is given a rectangle (dst_x, dst_y, width, height) with integer coordinates. To compute the result for the top left pixel, in principle, pixman should do this: 1. Add 0.5, 0.5 to dst_x, dst_y to get the coordinate of the first pixel 2. Transform those coordinates to get a source location 3. If a nearest filter is used, round to closest half-integer (down if they are equal) 4. Subtract 0.5, 0.5 to get x and y integers that can be used to compute an index into the bits[] array. For the non-transformed blitters, all of this disappears and becomes just an offset that can be computed upfront. But when there is a transformation, the full computation must be carried out. The pixman_fixed_1/2 constant used all over the place is just the 0.5 in step 1. For the nearest filter, step 3 and step 4 can be simplified to simply subtracting pixman_fixed_e (to ensure we round down), and then truncating. For bilinear filtering, the question of rounding down is mostly irrelevant because we need both the pixel below and above. In this case step 3 and 4 become simply a truncation. The question of rounding the bilinear interpolation location is separate. It's true that the truncation of the fraction can lead to shifts of a 128th of pixel, and maybe this should be fixed. A simple solution would be to just add 0x80 or 0x40 before shifting, but it might be possible do that only once per scanline, or even once per composite operation for scalings. I doubt that it can be done easily by modifying the pixman_fixed_1/2 constant though. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
Søren Sandmann sandm...@cs.au.dk writes: For bilinear filtering, the question of rounding down is mostly irrelevant because we need both the pixel below and above. In this case step 3 and 4 become simply a truncation. This is not true actually. To get the lower pixel, 0.5 is subtracted, and then the position is truncated. But the question of rounding down or up is irrelevant since when the location is exactly on top of a pixel, the two cases both end up with that pixel having a weight of 1 and the other (whether below or above) a weight of 0. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision
On Tue, Jun 26, 2012 at 6:04 PM, Søren Sandmann sandm...@cs.au.dk wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: Also 7-bit bilinear interpolation precision still allows a nice SSSE3 optimization, which can be implemented later to get even more speed. Presumably you mean pmaddubsw, Yes but isn't it a problem that the weights can go from 0 to 128 inclusive? That is, if one weight is 0, the other is 128, which means it doesn't actually fit in seven bits. So unless I am missing something, pmaddubsw can't be used without a test for 0. For 8-bit precision we had the same problem with weights going from 0 to 256 inclusive and it is already handled here: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n881 But the patches need a lot more testing on different hardware. Running the test suite on ppc64, ARM, x86-64 with various implementations disabled revealed no problems for me. Thanks. I'll still try to run some more tests and review the code again to check for any possible leftover magic constants hardcoding 8 bits of bilinear interpolation precision. There is also one thing to investigate. It was likely also a problem earlier, but may become worse when going to lower interpolation precision. Right now the position between pixels is rounded down by throwing away lowest bits when converting it to bilinear weight. It may potentially cause image drift to the left/upper side if the image is scaled up/down repeatedly really a lot. Maybe pixman_fixed_1 / 2 constant needs to be adjusted here (and in the other places too) to ensure more correct rounding: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.26.0#n811 But I guess it's better to write a test program to check whether such effect exists. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman