Hi, Francesco, Georg and other transcode developers! Here is a patch set, implementing several speed optimizations of video stabilization plugin (filter/stabilize/filter_stabilize.c).
First, I changed the return type of `compareImg()` and `compareSubImg()` functions from `double` to `unsigned long int` and added additional parameter `treshold` of the same type. When you call one of these funcs, you supply current best (minimum) value of error as `treshold`. In compare function the accumulative `sum` is checked against `treshold` once a line and if it become bigger than `treshold` function returns earlier. So we cut down obviusly false search paths. This improvement gives speed up of approximately three times, but this will depend on actual image data. Second, I implement use of SSE2 for compareSubImg() both for YUV and RGB, and contrastSubImg() for YUV only. I decided to use SSE2 intrinsics instead of direct asm inlines, because: 1) it's more portable (you need not think about what registers to use - %rax on x86_64 or %eax on x86, etc); 2) it is considired that intrinsics give more freedom for compiler to do additional optimizations. Use of SSE2 for compareSubImg() gives speed up of approximately three times. So, the total speedup of these improvements is about 9-10 times. The speedup was measured in shakiness=10,accuracy=10 mode. If interested, I will provide all test cases and test results and configuration of my PC. Use of SSE2 is controlled by config.h HAVE_ASM_SSE2 define. There are more defines in filter_stabilize.c to fine control SSE2 usage, which are commented in the code itself and in further algorithm descriptions. To make SSE2 implementation easier (and possibly faster) I have to put a limit on `field_size` parameter, which must be a multiple of 16 when using SSE2. There is set of four patches. First two are auxiliary. The third implements first improvement (with no SSE2), and the fourth implements SSE2. All patches are made on top of `transcode-1_1` mercurial branch. I will be glad, if my work will be accepted. I'm also ready for discuss. -------------------------------------------------------------------------------- Bits on algorithm used in compareSubImg() SSE2 edition. The main idea is to sum 16 bytes simultaneously. SSE2 use 128-bit registers, which holds 16 unsigned bytes just great. First, I load 16 bytes of data from first image (p1) into the first SSE2 register (xmm0), and load 16 bytes of data from second image (p2) into the second SSE2 register (xmm1). xmm0 = p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 xmm1 = p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 Now I need to determine absolute difference between each of 16 byte pairs. It's done in three steps (in saturated arithmetic, i.e. 255 + 2 = 255 (not 2) and 0 - 4 = 0 (not 252)), each operation perfomed byte-wise: 1) xmm2 = xmm0 - xmm1 2) xmm0 = xmm1 - xmm0 3) xmm0 = xmm0 + xmm2 Now, in xmm0 we have absolute difference for 16 bytes. xmm0 = d01 d02 d03 d04 d05 d06 d07 d08 d09 d10 d11 d12 d13 d14 d15 d16 We need to sum them. But we can't sum them directly, because the sum will not fit in one byte. So, I split xmm0 into two registers: odd bytes copied into xmm1 and even bytes shifted by one byte and stays in xmm0. So, now we have: xmm0 = 00 d01 00 d03 00 d05 00 d07 00 d09 00 d11 00 d13 00 d15 xmm1 = 00 d02 00 d04 00 d06 00 d08 00 d10 00 d12 00 d14 00 d16 There are also register to hold partial sums - xmmsum. It's treated as 8 16-bit words. Next I add my 16-bit formatted diffs from xmm0 and xmm1 to xmmsum: 1) xmmsum = xmmsum + xmm0 2) xmmsum = xmmsum + xmm1 This finish one cycle. Next I go to next 16 bytes of image data. Because total sum isn't fit in 16-bit words, I need to flush `xmmsum` register to regular C variable `sum` from time to time. How often to do the flush is determined by SSE2_CMP_SUM_ROWS define. Flushing assumes that I need to sum 8 16-bit words in `xmmsum` together before adding them to C `sum`. This can be done either by regular C code, or by SSE2 (controlled by USE_SSE2_CMP_HOR define). SSE2 is faster, but we limited by 16-bit intermediate sum, which is gauranted to NOT overflow ONLY if SSE2_CMP_SUM_ROWS <= 8. C code a bit slower, but we can use SSE2_CMP_SUM_ROWS up to 128, so flushing will be done rarely. In my tests, enabling USE_SSE2_CMP_HOR and setting SSE2_CMP_SUM_ROWS to 8 gives better results than using SSE2_CMP_SUM_ROWS = 128 and no USE_SSE2_CMP_HOR. -------------------------------------------------------------------------------- Bits on algorithm used in contrastSubImgYUV() SSE2 edition. The beginning is the same. I load 16 bytes of image data into xmm0. Before that, I prepare two registers `mmin` and `mmax` to hold respectively maximum (FF) and minimum (00) values. Then I use SSE2 min and max operations on this data: 1) mmin = min(mmin, xmm0); 2) mmax = max(mmax, xmm0); Then I go to next 16 bytes. When finish, I output `mmin` and `mmax` data to regular C variables `mini` and `maxi` by method similar to `xmmsum flushing` in compareSubImg() algorithm. -------------------------------------------------------------------------------- Best regards, Alexey.