Our current motion estimation design relies on precomputing the halfpel planes for each reference frame. Thus, halfpel motion estimation is almost as efficient as fullpel motion estimation. The quarterpel motion estimation is also cheap since we simply average the two nearest halfpel positions like in H.264. The accuracy loss is limited when SAD is used.
However, this method quadruples the memory requirements per reference frame. Furthermore, since the quarterpel interpolation is not accurate, SATD doesn't seem to improve quality. Since the quality boost from SATD is desirable in a high-quality setup, we must support efficient non-averaged quarterpel motion estimation. Cache efficiency for 64x64 CTBs and multiple reference frames is also a concern in its own right. It may well be that trading computations for cache is worth it. If we get rid of the precomputed planes, then we have to filter for both the halfpel and the quarterpel motion estimation. Let us explore the situation in both cases. Halfpel: F h F h F h F: fullpel v d v d v d h: horizontal halfpel. F h F h F h v: vertical halfpel. v d v d v d d: diagonal halfpel (horizontal & vertical). F h F h F h q: quarterpel position. v d v d v d Quarterpel: F q h q F q h q F q q q q q q q q q v q d q v q d q v q q q q q q q q q F q h q F q h q F q q q q q q q q q v q d q v q d q v q q q q q q q q q F q h q F q h q F There are 3 interpolations: horizontal, vertical and diagonal. The horizontal and vertical interpolations interpolate the fullpels into intermediate 16-bit values, which are then clamped to 8-bit. The diagonal interpolation interpolates the fullpels into intermediate 16-bit values. Then, the 16-bit values are interpolated into intermediate 32-bit values. Finally, the 32-bit values are clamped to 8-bit. Processing 32-bit values take twice the time of processing 16-bit values. Hence the diagonal interpolation requires 3x more time than the horizontal or vertical interpolation. The proposed design is custom-made for the X-diamond algorithm. X-diamond moves for 0.5*2=1 pixel in halfpel ME, and 0.25*2=0.5 pixel in quarterpel ME, for a total of 1.5 pixels. For the halfpel interpolation, if we start from a fullpel position, then we can move at most 1 pixel on each side, and so the horizontal halfpel on either side of a fullpel can be visited. Thus, for a NxN block, we must interpolate (N+1)x(N+1) pixels minimally. For the quarterpel interpolation, *if we start from a halfpel position*, then we can move at most 0.5 pixel on each side. For the horizontal interpolation, we visit the following quarterpel positions horizontally: - Initial position is fullpel: -0.5, -0.25, 0, 0.25, 0.5 - Initial position is halfpel: 0, 0.25, 0.5, 0.75, 1 Hence, the 1/4 and 3/4 quarterpel fractions are visited only once horizontally, but the 2/4 fraction can be visited twice horizontally. The 16-bit result of the horizontal interpolation is used for the interpolation of the 3 vertical quarterpel positions below that horizontal position. Hence, for the quarterpel interpolation, we must interpolate (N+1)x(N+1) pixels for the 2/4 fraction, and Nx(N+1) pixels for the 1/4 and 3/4 fractions. Interpolating (N+1) pixels horizontally instead of N pixels is expensive because the extra pixel requires the use of an extra register. If we combine the halfpel and quarterpel cases, we arrive to the following design. Interpolate (N+1)x(N+1) pixels as 8-bit for the vertical halfpel plane. Interpolate (N+2)x(N+2+4+4) pixels horizontally as 16-bit for the horizontal and diagonal cases. Pack to 8-bit for the horizontal halfpel plane. Interpolate (N+1)x(N+1+4+4) 16-bit values as 8-bit for the diagonal halfpel plane. At this point, we have computed the 3 halfpel planes and we can perform the halfpel motion estimation. The halfpel memory requirements are as follow: - 1 interpolation result (1). - 1 source plane (1). - 1 fullpel plane (1). - 3 halfpel planes (3). - 1 intermediate 16-bit horizontal plane kept around for the quarterpel interpolation (2). Total: about 8x the original block size. For the quarterpel ME, we already have the 2/4 fraction interpolated horizontally. We interpolate the 1/4 and 3/4 fractions horizontally as two Nx(N+1) blocks. Then we perform the quarterpel ME. The positions aligned vertically on a fullpel use the 16-bit vertical interpolation. The positions aligned horizontally on a fullpel clip the 3 precomputed horizontal 16-bit planes. The other positions use the 32-bit vertical interpolation from the precomputed 16-bit planes. The quarterpel memory requirements are as follow: - 1 interpolation result (1). - 1 source plane (1). - 1 fullpel plane (1). - 3 16-bit halfpel planes (6). Total: about 9x the original block size. The amount of computations required for quarterpel ME is much higher than for halfpel ME. For halfpel, we have 2 16-bit interpolations (cost 2*1), one 32-bit interpolation (cost 2), total cost 4. For quarterpel, in the worst case all interpolations are diagonal. There are 3 16-bit interpolations (cost 3*1) and 8 32-bit interpolations (cost 8*2), total cost 19. Nothing precludes us from using our current averaging algorithm to speed up the quarterpel interpolation at low quality. It's probably cheaper to test 8 averaged positions with X-diamond rather than 4 correctly interpolated positions with diamond, and it might even yield better quality since we test more positions. We'll have to test. If we go that route, then we can drop the intermediate 16-bit horizontal plane used for the quarterpel interpolation, bringing down the memory requirements down to 6x the original block size. If we assume zero movement, the local memory requirements for a 64x64 block for both designs are as follow. Source + interpolation = 64*64*2 = 8K. Each block line in a reference plane requires 2 64-byte cache lines, thus 2*64*64 = 8K per reference plane. Old design: 64*64*2 + 4*64*64*2 = 40K (1 ref). New design: 64*64*2 + 1*64*64*2 + 3*2*64*64 = 40K (1 ref with full qpel). New design: 64*64*2 + 1*64*64*2 + 3*64*64 = 28K (1 ref with fake qpel). We still trash the 32K L1 data cache in any case. But the cache usage grows much less quickly when multiple reference frames are used. In practice, there is going to be some vertical movement, so the old design will actually consume more cache than what we computed here. We can expect to cram more data in L2 with the new design and reduce the number of costly L3 look-ups. We don't have a choice in we want SATD. Whether this will improve or lower performance in practice remains to be seen. Laurent -- To unsubscribe visit http://f265.org or send a mail to [email protected].
