[f265 dev team] Cache-efficient motion estimation

Laurent Birtz Mon, 28 Jul 2014 09:08:01 -0700

Our current motion estimation design relies on precomputing the halfpel
planes for each reference frame. Thus, halfpel motion estimation is
almost as efficient as fullpel motion estimation. The quarterpel motion
estimation is also cheap since we simply average the two nearest halfpel
positions like in H.264. The accuracy loss is limited when SAD is used.


However, this method quadruples the memory requirements per reference
frame. Furthermore, since the quarterpel interpolation is not accurate,
SATD doesn't seem to improve quality.

Since the quality boost from SATD is desirable in a high-quality setup,
we must support efficient non-averaged quarterpel motion estimation.
Cache efficiency for 64x64 CTBs and multiple reference frames is also a
concern in its own right. It may well be that trading computations for
cache is worth it.

If we get rid of the precomputed planes, then we have to filter for both
the halfpel and the quarterpel motion estimation. Let us explore the
situation in both cases.

Halfpel:

  F h F h F h     F: fullpel
  v d v d v d     h: horizontal halfpel.
  F h F h F h     v: vertical halfpel.
  v d v d v d     d: diagonal halfpel (horizontal & vertical).
  F h F h F h     q: quarterpel position.
  v d v d v d

Quarterpel:

  F q h q F q h q F
  q q q q q q q q q
  v q d q v q d q v
  q q q q q q q q q
  F q h q F q h q F
  q q q q q q q q q
  v q d q v q d q v
  q q q q q q q q q
  F q h q F q h q F

There are 3 interpolations: horizontal, vertical and diagonal. The
horizontal and vertical interpolations interpolate the fullpels into
intermediate 16-bit values, which are then clamped to 8-bit. The
diagonal interpolation interpolates the fullpels into intermediate
16-bit values. Then, the 16-bit values are interpolated into
intermediate 32-bit values. Finally, the 32-bit values are clamped to
8-bit. Processing 32-bit values take twice the time of processing 16-bit
values. Hence the diagonal interpolation requires 3x more time than the
horizontal or vertical interpolation.

The proposed design is custom-made for the X-diamond algorithm.
X-diamond moves for 0.5*2=1 pixel in halfpel ME, and 0.25*2=0.5 pixel in
quarterpel ME, for a total of 1.5 pixels.

For the halfpel interpolation, if we start from a fullpel position, then
we can move at most 1 pixel on each side, and so the horizontal halfpel
on either side of a fullpel can be visited. Thus, for a NxN block, we
must interpolate (N+1)x(N+1) pixels minimally.

For the quarterpel interpolation, *if we start from a halfpel position*,
then we can move at most 0.5 pixel on each side. For the horizontal
interpolation, we visit the following quarterpel positions horizontally:
- Initial position is fullpel: -0.5, -0.25, 0, 0.25, 0.5
- Initial position is halfpel: 0, 0.25, 0.5, 0.75, 1

Hence, the 1/4 and 3/4 quarterpel fractions are visited only once
horizontally, but the 2/4 fraction can be visited twice horizontally.
The 16-bit result of the horizontal interpolation is used for the
interpolation of the 3 vertical quarterpel positions below that
horizontal position. Hence, for the quarterpel interpolation, we must
interpolate (N+1)x(N+1) pixels for the 2/4 fraction, and Nx(N+1) pixels
for the 1/4 and 3/4 fractions. Interpolating (N+1) pixels horizontally
instead of N pixels is expensive because the extra pixel requires the
use of an extra register.

If we combine the halfpel and quarterpel cases, we arrive to the
following design.

Interpolate (N+1)x(N+1) pixels as 8-bit for the vertical halfpel plane.
Interpolate (N+2)x(N+2+4+4) pixels horizontally as 16-bit for the
horizontal and diagonal cases. Pack to 8-bit for the horizontal halfpel
plane. Interpolate (N+1)x(N+1+4+4) 16-bit values as 8-bit for the
diagonal halfpel plane. At this point, we have computed the 3 halfpel
planes and we can perform the halfpel motion estimation.

The halfpel memory requirements are as follow:
- 1 interpolation result (1).
- 1 source plane (1).
- 1 fullpel plane (1).
- 3 halfpel planes (3).
- 1 intermediate 16-bit horizontal plane kept around for the quarterpel
  interpolation (2).
Total: about 8x the original block size.

For the quarterpel ME, we already have the 2/4 fraction interpolated
horizontally. We interpolate the 1/4 and 3/4 fractions horizontally as
two Nx(N+1) blocks. Then we perform the quarterpel ME. The positions
aligned vertically on a fullpel use the 16-bit vertical interpolation.
The positions aligned horizontally on a fullpel clip the 3 precomputed
horizontal 16-bit planes. The other positions use the 32-bit vertical
interpolation from the precomputed 16-bit planes.

The quarterpel memory requirements are as follow:
- 1 interpolation result (1).
- 1 source plane (1).
- 1 fullpel plane (1).
- 3 16-bit halfpel planes (6).
Total: about 9x the original block size.

The amount of computations required for quarterpel ME is much higher
than for halfpel ME. For halfpel, we have 2 16-bit interpolations (cost
2*1), one 32-bit interpolation (cost 2), total cost 4. For quarterpel,
in the worst case all interpolations are diagonal. There are 3 16-bit
interpolations (cost 3*1) and 8 32-bit interpolations (cost 8*2), total
cost 19.

Nothing precludes us from using our current averaging algorithm to speed
up the quarterpel interpolation at low quality. It's probably cheaper to
test 8 averaged positions with X-diamond rather than 4 correctly
interpolated positions with diamond, and it might even yield better
quality since we test more positions. We'll have to test. If we go that
route, then we can drop the intermediate 16-bit horizontal plane used
for the quarterpel interpolation, bringing down the memory requirements
down to 6x the original block size.

If we assume zero movement, the local memory requirements for a 64x64
block for both designs are as follow. Source + interpolation = 64*64*2 =
8K. Each block line in a reference plane requires 2 64-byte cache lines,
thus 2*64*64 = 8K per reference plane.
Old design: 64*64*2 + 4*64*64*2 = 40K (1 ref).
New design: 64*64*2 + 1*64*64*2 + 3*2*64*64 = 40K (1 ref with full qpel).
New design: 64*64*2 + 1*64*64*2 + 3*64*64 = 28K (1 ref with fake qpel).

We still trash the 32K L1 data cache in any case. But the cache usage
grows much less quickly when multiple reference frames are used. In
practice, there is going to be some vertical movement, so the old design
will actually consume more cache than what we computed here.

We can expect to cram more data in L2 with the new design and reduce the
number of costly L3 look-ups. We don't have a choice in we want SATD.
Whether this will improve or lower performance in practice remains to be
seen.

Laurent
--
To unsubscribe visit http://f265.org
or send a mail to [email protected].

[f265 dev team] Cache-efficient motion estimation

Reply via email to