Hi, Francesco, Georg and other transcode developers!

Here is a patch set, implementing several speed optimizations
of video stabilization plugin (filter/stabilize/filter_stabilize.c).

First, I changed the return type of `compareImg()` and `compareSubImg()`
functions from `double` to `unsigned long int` and added additional
parameter `treshold` of the same type. When you call one of these funcs,
you supply current best (minimum) value of error as `treshold`. In compare 
function
the accumulative `sum` is checked against `treshold` once a line and if
it become bigger than `treshold` function returns earlier. So we cut down
obviusly false search paths. This improvement gives speed up of approximately
three times, but this will depend on actual image data.

Second, I implement use of SSE2 for compareSubImg() both for YUV and RGB,
and contrastSubImg() for YUV only. I decided to use SSE2 intrinsics instead of
direct asm inlines, because: 1) it's more portable (you need not think about
what registers to use - %rax on x86_64 or %eax on x86, etc); 2) it is considired
that intrinsics give more freedom for compiler to do additional optimizations.
Use of SSE2 for compareSubImg() gives speed up of approximately three times.

So, the total speedup of these improvements is about 9-10 times. The speedup
was measured in shakiness=10,accuracy=10 mode. If interested, I will provide
all test cases and test results and configuration of my PC.

Use of SSE2 is controlled by config.h HAVE_ASM_SSE2 define.
There are more defines in filter_stabilize.c to fine control SSE2 usage, which
are commented in the code itself and in further algorithm descriptions.

To make SSE2 implementation easier (and possibly faster) I have to put a limit
on `field_size` parameter, which must be a multiple of 16 when using SSE2.

There is set of four patches. First two are auxiliary. The third implements
first improvement (with no SSE2), and the fourth implements SSE2.

All patches are made on top of `transcode-1_1` mercurial branch.

I will be glad, if my work will be accepted. I'm also ready for discuss.

--------------------------------------------------------------------------------

Bits on algorithm used in compareSubImg() SSE2 edition.

The main idea is to sum 16 bytes simultaneously. SSE2 use 128-bit registers, 
which
holds 16 unsigned bytes just great. First, I load 16 bytes of data from first 
image (p1) into the first SSE2 register (xmm0), and load 16 bytes of data from 
second 
image (p2) into the second SSE2 register (xmm1).

xmm0 = p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1
xmm1 = p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2 p2

Now I need to determine absolute difference between each of 16 byte pairs. 
It's done in three steps (in saturated arithmetic, i.e. 255 + 2 = 255 (not 2)
and 0 - 4 = 0 (not 252)), each operation perfomed byte-wise:
1) xmm2 = xmm0 - xmm1
2) xmm0 = xmm1 - xmm0
3) xmm0 = xmm0 + xmm2

Now, in xmm0 we have absolute difference for 16 bytes. 

xmm0 = d01 d02 d03 d04 d05 d06 d07 d08 d09 d10 d11 d12 d13 d14 d15 d16

We need to sum them. But we can't sum them directly, because the sum will not 
fit in 
one byte. So, I split xmm0 into two registers: odd bytes copied into xmm1 and 
even 
bytes shifted by one byte and stays in xmm0. So, now we have:

xmm0 = 00 d01 00 d03 00 d05 00 d07 00 d09 00 d11 00 d13 00 d15
xmm1 = 00 d02 00 d04 00 d06 00 d08 00 d10 00 d12 00 d14 00 d16

There are also register to hold partial sums - xmmsum. It's treated as 8 16-bit 
words.
Next I add my 16-bit formatted diffs from xmm0 and xmm1 to xmmsum:

1) xmmsum = xmmsum + xmm0
2) xmmsum = xmmsum + xmm1

This finish one cycle. Next I go to next 16 bytes of image data.

Because total sum isn't fit in 16-bit words, I need to flush `xmmsum` register 
to
regular C variable `sum` from time to time. How often to do the flush is 
determined by
SSE2_CMP_SUM_ROWS define.

Flushing assumes that I need to sum 8 16-bit words in `xmmsum` together before 
adding
them to C `sum`. This can be done either by regular C code, or by SSE2 
(controlled by
USE_SSE2_CMP_HOR define). SSE2 is faster, but we limited by 16-bit intermediate 
sum,
which is gauranted to NOT overflow ONLY if SSE2_CMP_SUM_ROWS <= 8. C code a bit 
slower,
but we can use SSE2_CMP_SUM_ROWS up to 128, so flushing will be done rarely. In 
my tests,
enabling USE_SSE2_CMP_HOR and setting SSE2_CMP_SUM_ROWS to 8 gives better 
results than
using SSE2_CMP_SUM_ROWS = 128 and no USE_SSE2_CMP_HOR.

--------------------------------------------------------------------------------

Bits on algorithm used in contrastSubImgYUV() SSE2 edition.

The beginning is the same. I load 16 bytes of image data into xmm0. Before 
that, I
prepare two registers `mmin` and `mmax` to hold respectively maximum (FF) and 
minimum (00) values. Then I use SSE2 min and max operations on this data:

1) mmin = min(mmin, xmm0);
2) mmax = max(mmax, xmm0);

Then I go to next 16 bytes.

When finish, I output `mmin` and `mmax` data to regular C variables `mini` and 
`maxi`
by method similar to `xmmsum flushing` in compareSubImg() algorithm.

--------------------------------------------------------------------------------

Best regards,
Alexey.


Reply via email to