This is my latest attempt to design the global mode decision logic in f265
(yes, one of many -- this is a difficult problem). It incorporates the
knowledge I gained in the ongoing refactorization of the analysis code.
The most important part of the analysis is to choose the size of the blocks
(4x4, 8x8, 16x16, 32x32, 64x64) and their mode (intra or inter). This document
describes heuristics to achieve that result quickly.
* Acronym soup.
** Rate-distortion models.
RDM: (standard) rate-distortion model. The distortion is computed by the
difference between the source pixels and the predicted pixels (SAD or SATD).
The rate is computed by the total number of bins, excluding the transform bins.
All bins are considered to be bypass bins.
RDO: rate-distortion optimization. The distortion is computed by the difference
between the source pixels and the reconstructed pixels (SSD). The rate is
computed by the precise entropy loss in the actual encoding process.
RDA: rate-distortion approximation. Like RDO, but the rate is computed
approximately.
** Distortion metrics.
SAD: sum of absolute differences, a fast metric used in the fullpel search.
SATD: sum of absolute transformed differences, a slow metric sometimes used in
the subpel search.
SSD: sum of squared differences, a fast metric used in the reconstructed
domain.
** Ambiguity zone.
I refer to the set of block sizes and modes that are potentially interesting as
the ambiguity zone. It is abbreviated to "the zone" for the rest of the text.
Initially the zone includes the whole CTB. The zone shrinks down to a single
choice as the analysis progresses.
* Summary of my R&D findings.
Small blocks matter more than large blocks. H.264 does fine without 32x32 and
64x64 blocks usually. It is much better to incorrectly skip the analysis of a
large block than to incorrectly skip the analysis of a small block. However,
small blocks takes more time to process than large blocks.
Inter is chosen more often than intra, especially at high QP.
Using the reconstructed pixels is important for intra prediction, even in RDM.
Quality suffers significantly when the source pixels are used for prediction.
In RDO, 40% of the time is spent in (I)DCT and (de)quantization, and 60% is
spent in the 2 assembly preprocessing functions and the C-only transform block
encoding function. The latter function cannot be vectorized and it is quite
slow.
The residual error depends directly on the quantization error. For the
quantization error to be accurate, the DCT coefficients and the predicted
pixels must also be accurate. I could not find a fast approximation. AFAICT the
reconstruction and the SSD must be computed exactly for RDO.
Experimentally, the rest of the RDO process requires much less accuracy. I
tried replacing all context bin costs by a fixed value equal to 50% of the cost
of a bypass bin for the entire encoding at QP 30. The quality loss is
surprisingly low. This suggests that the RDO rate cost can be approximated by
an assembly function that counts the approximate number of context and bypass
bins in the transform block. I call this method RDA.
The distinction between context bins and bypass bins does not seem important in
the RDM domain. I had no success estimating the cost of context bins with
either a fixed cost model or the costs estimated from the current CABAC
contexts. The MV costs are an exception. There seems to be a minor benefit in
modelizing the cost of the "greater-than-0" and "greater-than-1" context bins
according to the QP in the precomputed MV cost tables at high QP.
Some of the intra 4x4 and 8x8 partitions can be eliminated with negligible
quality loss by computing the variance of the pixels. I did a test using the
variance computed at the 16x16 CB level at QP 30. By comparing the variance
against precomputed thresholds, 50% of the 4x4 partitions and 33% of the 8x8
partitions were rejected for about 0.01 and 0.04 PSNR db loss respectively. The
method seems to work for many videos. Perhaps the method can be applied in
reverse to reject improbable 32x32 and 64x64 partitions. The 4x4 blocks can
probably be skipped more aggressively by computing the 8x8 variance.
I know of no algorithms that can safely rule out improbable 8x8 inter blocks.
Testing them seems to be the only method.
The lookahead does not seem to be a useful source of information for the mode
decisions of a CTB in the main encoding. In general, Compression = Prediction +
Correction. High quality is obtained by minimizing the correction cost reliably
in the context of the current prediction. The lookahead does not have access to
the QP, the neighbour motion vectors and intra modes, or the reconstructed
pixels in either the source and the reference frames. It typically operates in
the subsampled domain, further reducing its accuracy.
In summary, the lookahead doesn't have to access to a good prediction. Hence
the decisions from the lookahead are potentially far from optimal. Doing a
lookahead pass immediately before the main encoding would give it access to the
QP, but that model is not compatible with frame multithreading and it doesn't
solve the other problems.
The lookahead is very useful for rate control, frame type decision, scene cut
detection, AQ, and block flow, especially if the processing can be offloaded to
a GPU. Beyond that, at this time, I see no further use except very fast
encoding at sub-H.264 quality. Nevertheless, I will add hooks to control the
encoding using information from external sources to future-proof the design.
* Building blocks.
** Multi-level inter 16x16 search.
The inter 16x16 and 8x8 searches are always performed in the planned
implementation. The idea of the multi-level (ML) search is to do the 16x16
search while performing some of the processing of the 8x8 search, whose results
are then used for the 16x8 and 8x16 search. The ML search is final for the
16x16 block itself. A regular 8x8 search still needs to be done afterwards.
There are three goals:
- Speed up the upcoming 8x8 fullpel search by doing the brunt of the ME at the
16x16 level, so that the "diamond" algorithm can be used without losing much
quality at the 8x8 level.
- Find far positions that wouldn't be found with the "hexagon" algorithm at the
8x8 level.
- Potentially use the information to skip inter 8x8 at low quality.
The implementation works by computing the SAD of each 8x8 block in the 16x16
blocks. Conveniently, the AVX2 SAD function stores the SAD per 8 pixels, so we
end up with 4 sums for one 16x16 block. There is some moderate overhead in
updating the best cost and position of the 8x8 blocks (I would say around 30%
of the total cost).
The search function works entirely in assembly. The source is stored in 8 ymm
registers once. The high-level logic provides a list of MV positions to test.
The low-level logic loops on the MVs, computes the MV and distortion costs, and
returns the best position found. Potentially both PMVs can be tested conjointly
at each position to allow a greater degree of freedom in the search.
The initial list of positions to test includes a "cross" pattern to capture
fast camera panning movements and other fixed positions which are not
influenced by the best location found so far. Refinement is performed
afterwards with a strong ME algorithm like UMH or HM expanding/contracting
diamonds.
The 16x16 MV costs may be ignored entirely for the 8x8 blocks, or downscaled
with the expectation that some MVs are going to be correlated between the 8x8
blocks. What really matters is to find a good spot during the ML search if one
exists. The proper 8x8 search can correct inaccuracies.
The halfpel and quarterpel refinement could also be done with a variant of this
algorithm. I'm not sure if it makes sense. It's probably only useful to
estimate more reliably whether the 8x8 blocks are well correlated with the
16x16 block in order to skip the 8x8 search entirely.
** 11-base-mode intra search.
D M V M D H: horizontal.
M ....... V: vertical.
H ....... D: diagonals (3).
M ....... M: midpoints (4).
D ....... .: intra block.
There are 35 intra modes that can be selected. We test 11 of these modes
unconditionally in one assembly call: DC, planar, vertical, horizontal, 3
diagonals, 4 midpoints. There is one function per block size (from 4x4 to
32x32).
As an optimization, the midpoints do not correspond to any angle defined by the
spec. The spec angles are as follow (fractions of 32): 0, 2, 5, 9, 13, 17, 21,
26, 32. The closest legal angle is 17/32; we cheat and use 16/32. This allows
us to test the midpoint positions faster since a predicted pixel corresponds
either to a single neighbour or the average of two neighbours. This is at least
two times faster than doing the proper irregular weighting.
An optional refinement is done afterwards near the best angular position. The
refinement can be skipped if the DC or planar mode was selected. The refinement
algorithm depends on the block size.
For 4x4 and 8x8 blocks, unfiltered neighbours are always used for the irregular
angles. We simply define 9 assembly "dispatcher" functions to refine around the
best angular position. A dispatcher tests 2 or 3 positions on either side of
the best position. If that position is a midpoint, we discard its cost prior to
the refinement. The dispatcher can preload the source and the neighbours once,
which it more efficient than redoing it 6 times in the individual calls.
For 16x16 and 32x32 blocks, the prediction is more costly per-mode and the
logic is more complex since the neighbours can be filtered depending on the
angle. For these block sizes, we probably want to use our current breadth-first
algorithm with individual function calls from C.
For chroma, we need a 4-base-mode search function (DC, planar, horizontal,
vertical) for 4x4, 8x8 and 16x16 blocks.
We need 4 assembly functions to both pick up and filter the neighbours (from
4x4 to 32x32). The function must consider the availability range so it is not
trivial. Filtering is not needed for chroma but there's no point in optimizing
for this. We probably want to drop support for non-bilinear filtering of 32x32
neighbours to simplify things.
* Exploration.
The exploration of the CTB starts at the 16x16 CB level. This is the
intermediate level. There are two block sizes below (4x4 and 8x8) and two block
sizes above (32x32 and 64x64).
Initially we try to shrink the zone quickly. We rule out improbable intra
blocks using the variance and we run the ML search. We potentially rule out
inter 8x8 based on a correlation criteria from the ML search at very low
quality. We may try an early inter skip at the 16x16 level, possibly by
checking if the neighbours skipped or by comparing the inter 16x16 SAD against
a QP-dependent threshold.
If both the inter 8x8 and the intra 8x8 blocks were eliminated, then we rule
out the sub 16x16 level, including inter 16x8 and 8x16. Otherwise, we perform a
preliminary search at the 8x8 CB level. For each 8x8 CB, we test intra 8x8 and
inter 8x8 (if not ruled out) in RDM mode. The test for intra 8x8 may be limited
to the 11-base function at low quality. The intra 8x8 refinement may be
conditional to intra 8x8 11-base being competitive with inter 8x8.
Intra 4x4 may be skipped or tested only if intra 8x8 is competitive/better than
inter 8x8. Possibly only the 11-base function may be used for each
unreconstructed 4x4 partition. In that case, 50% of the neighbours are
reconstructed pixels from the 8x8 boundary, and 50% of the neighbours are
source pixels. Inter 8x4/4x8 is never tested in the preliminary search since it
is both very expensive and very improbable.
After the 8x8 CB is analyzed, the best mode is optionally reconstructed to
improve intra prediction in the other 8x8 CBs. This can be skipped at very low
quality or if 8x8 intra was ruled out by the variance test. If skipped, 50% of
the reconstructed neighbours are still available from the 16x16 boundary. If we
do choose to reconstruct, then it's probably worthwhile to go the extra mile
and get the RDA cost of the 8x8 CB for a later comparison against the 16x16 RDA
cost. The reconstruction uses the 8x8 transform unless intra 4x4 was chosen. If
intra 4x4 was chosen, then either the modes previously found are used as-is, or
the full refinement is performed from scratch for each partition.
Back to the 16x16 CB level, inter 16x8 and 8x16 are tested unconditionally, or
if 8x8 inter RDM is better/competitive with inter 16x16 and if inter 8x8/16x16
is competitive with intra 8x8. The 16x8/8x16 partitions reuse the MVs found in
the 8x8 and 16x16 search and possibly reuse their reference indices too.
Intra 16x16 is tested unless intra 4x4 was selected for an intra 8x8 block, the
modes of the intra 8x8 blocks are decorrelated, or intra 8x8 is not competitive
with inter (if it was skipped by variance, then it's competitive). The intra
refinement may be conditional to the base-11 mode being competitive.
The AMP modes may then be tested based on some criterias that we need to
establish, e.g. if the related inter 16x8/8x16 mode was selected.
At this point, a final decision may be made between the 8x8 CBs and the 16x16
CB based on the RDM cost. At very low quality, we simply choose the mode having
the best RDM cost and we move on to the next 16x16 CB. At low quality, if the
16x16 RDM cost is not competitive with the 8x8 RDM cost, then we choose the 8x8
CBs and perform the final 8x8 exploration pass. That pass behaves much like the
early pass but it doesn't re-test the modes that were excluded in the early
pass. The final pass may test intra 4x4 and inter 8x4/4x8. The refinement logic
is similar to the logic described below for the 16x16 CB.
Otherwise, if the best 16x16 RDM cost is competitive with the 8x8 RDM cost,
then we use RDA to resolve the ambiguity. If the early 8x8 RDA cost is not
available because intra 8x8 was ruled out by variance so only inter 8x8 was
tested, then we may rule out inter 8x8 directly if its RDM cost is worse than
the best 16x16 mode, otherwise we obtain the early 8x8 RDA cost without
analysis. We test the 16x16 mode having the best RDM cost in RDA. That cost is
the early 16x16 RDA cost.
If the early 16x16 RDA cost is not competitive with the early 8x8 RDA cost,
then we rule out the 16x16 CB and perform only the 8x8 refinement. Otherwise,
we refine opportunistically.
If the early 8x8 RDA cost is better (not just competitive, but actually better)
than the early 16x16 RDA cost, then we perform the 8x8 refinement. If the early
16x16 RDA cost is not competitive with the late 8x8 RDA/RDO cost, then we rule
out the 16x16 CB.
Otherwise, if the early 16x16 RDA cost is better than the early 8x8 CB RDA cost
or if it is competitive with the late 8x8 CB RDA cost, then we perform the
16x16 refinement. The 16x16 modes whose RDM cost is competitive with the best
RDM cost (having the 8x8 RDM cost is helpful here) are tested with RDA. After
the 16x16 RDA refinement, the final exploration of the 8x8 CBs is done if it
was not already done and the early 8x8 RDA cost is still competitive with the
late 16x16 RDA cost.
When the unsplit mode of a CB is selected, optionally the RDO refinement can be
done to optimize the transform tree. This can be optimized, e.g. if there is
only one mode to test under RDA, then it can be tested under RDO immediately.
The RDOQ can occur at this step. In fact, perhaps it doesn't make sense to use
RDO without RDOQ if RDA is as good as I hope it is. The RDO refinement can be
done on several of the best RDA modes at very high quality.
The 32x32 search is conditional to passing a correlation test on the 16x16 CBs
(I guess it's helpful even at high quality). If intra 16x16 is selected for all
four 16x16 CBs, then the modes of those CBs are tested (up to 4 modes, and only
these modes). If inter 16x16 is selected for all four 16x16 CBs, then the 32x32
search is performed, and possibly 16x8 and 8x16 too, following the same logic
as the 16x16 CB. The 64x64 CB follows the same logic as the 32x32 CB.
The algorithm described above focus on the early terminations at low to mid
quality. It's easy to tune the algorithm for higher quality: make the early
terminations less aggressive and put more time in the early 8x8 exploration
pass (or make it a final pass instead).