Hi, I have an idea for caching the motion vectors found in the inter search, and reusing them as initial candidates in the fullpel search of other blocks.
The central component is the motion vector field (MVF). The MVF keeps track of the best motion vector found for each reference list and reference index in a block (f265_mv mvf[2][8]). To save space and to cater for the spec limitations, we allow up to 8 reference indices per reference list. We define the following MVFs (832 bytes in 13 cache lines): - f265_mv mvf_16x16[4][2][8]; - f265_mv mvf_32x32[4][2][8]; - f265_mv mvf_64x64[4][2][8]; - f265_mv mvf_part[2][8]; The 16x16 MVF contains a MVF for each 8x8 block in a 16x16 CB. The 32x32/64x64 MVF contains a MVF for each 16x16/32x32 block in a 32x32/64x64 CB. There is only one 16x16 MVF instance per CTB. It is shared by all 16x16 blocks in the CTB (same for the 32x32 MVF). The partition MVF contains the motion vectors found in the current inter partition being analyzed. If an inter block contains two partitions (e.g. 2 16x8 partitions in a 16x16 inter block), then the MVF is shared by these partitions. The analysis of the second partition trashes the information found in the first partition. However, since the processing is symmetrical for both partitions, the second partition can safely read the MVF data of the first partition before it clobbers it in place. The MVFs are zeroed prior to the analysis of a CB. Hence, if a reference index is not tested due to early termination, the corresponding MV in the MVF will be zero and it won't be used as a candidate in the following inter blocks. The CB initialization function obtains the 6 MVFs corresponding to the current CB. The pointers to these MVFs are stored in an array, as follow. <part> <parent> <sub0> <sub1> <sub2> <sub3> "part" is the partition MVF. For 8x8 CBs and 16x16 CBs, "sub0-3" are the 4 slots of the 16x16 MVF, and "parent" is the slot of the 32x32 MVF that corresponds to the location of the 16x16 CB in the 32x32 CB. For 32x32 CBs, "sub0-3" are the 4 slots of the 32x32 MVF, and "parent" is the slot of the 64x64 MVF that corresponds to the location of the 32x32 CB in the 64x64 CB. For the 64x64 CB, "sub0-3" are the 4 slots of the 64x64 MVF and "parent" is null. Initially, the MVFs are empty. The 16x16/8x8 multi-level search initializes the parent slot with the 16x16 data and the 8x8 slots with the 8x8 data. Since the ML search is final for the 16x16 partitions, the 16x16 partitions do not use the MVF candidates at all. Otherwise, the analysis of the current inter partition uses the MV from the MVFs as follow. For a 8x8 partition, use the MV from the 16x16 level, the MVs from the left/top 8x8 blocks if present, and the MV from the current 8x8 block since it was prepopulated by the ML search. For a 8x4/4x8 partition, use the MV from the current 8x8 block, the MV of the first partition if present, and perhaps the MVs from the two neighbour 8x8 CBs if present and if it actually helps in practice. For the other hybrid partitions (16x8/8x16, 32x16/16x32, 64x32/32x64, and their AMP equilvalents), use the MVs from the parent level, the MVs from the two underlying sub blocks, and the MVs of the first partition if present. For the 32x32 and 64x64 partitions, use the MVs from the four sub slots. All that logic is implemented efficiently with precomputed tables and an assembly function. To simplify, there are always four candidates per partition. In the case where the logic calls for less, the entries are padded with duplicate partition MVF entries (which was zeroed for the first partition). Hence, the same MVFs are used for both partitions of an inter block. The pointers to the four chosen MVFs are stored in the mvf_array at the beginning of the analysis of a partition mode. This is done by looking up the four offsets in the table corresponding to the block size and the partition mode. This can be a two-level lookup to save space. The assembly function takes as input the two PMVs, the mvf_array, the MVF offset corresponding to the current list and partition index, and the clipping bounds. The output is two clipped PMVs, the unique candidate count, and up to 5 clipped candidate MVs. The function adds the (0,0) candidate to the four candidates passed as parameter and it eliminates all candidates that are duplicate of another candidate or PMV (vectorized branch-free exhaustive comparisons). When the 8x8/16x16/32x32 partition mode has been analyzed, the partition MVs are copied in the corresponding 16x16/32x32/64x64 MVF slot (64 byte copy). The branch can be avoided by choosing the partition MVF as the destination for a non-UN mode (i.e. always store by copying 2 ymm somewhere). The only unpredictable branch in the implementation is the loop over the unique candidates (a variable number) to get the best initial fullpel position. Laurent -- To unsubscribe visit http://f265.org or send a mail to [email protected].
