Hi,
included is a design document for the analysis code refactoring.
Laurent
Analysis code refactoring.
* Goals.
- Reduce the growing software debt in the analysis code.
- Make it easier to experiment with different algorithms.
- Improve the quality/performance trade-offs available to the user.
Optimizing the analysis is a difficult task. I'm missing critical information
about which methods work and which do not. But I cannot experiment easily in
the current code. Thus I need to refactor now based on some hypotheses.
* Theory.
** Errors.
The exploration strategy must balance between three types of errors:
- A: work leading to no useful results.
- B: work done twice because the environment has changed (dependencies,
metrics, etc).
- C: wrong decision due to imprecise costs.
A full search eliminates type B and C errors but maximizes type A errors. The
type A errors are unavoidable. Prediction can help to prune the search space,
but it can only find the solutions that get predicted. When the solution lies
elsewhere, the search algorithm must test additional modes beyond the predicted
modes, either unconditionally or when it detects that the prediction is wrong.
The optimal mode of a block depends heavily on the mode of the neighbour
blocks. The effect is more pronounced for smaller blocks. A preliminary
exploration of a block can yield a mode that differs from the mode found with a
more thorough exploration. The discrepancy between the two modes may cause type
C errors, but more significantly, the discrepancy propagates to the sibling
blocks, possibly in a positive feedback loop. Thus the modes obtained during a
preliminary exploration pass may be decorrelated with the modes found in the
final exploration pass. This problem can be reduced by performing mode
refinement alongside the exploration rather than after the exploration.
The intra mode exploration is sensitive to changes in the environment. Changing
the neighbour modes affect the signaling costs (intra mode + transform tree)
and the neighbour pixels. In the final exploration pass, the preliminary costs
can be discarded (type B error) or they can be used regardless (type C error).
The inter mode exploration is very sensitive to changes in the environment.
Changing the neighbour modes affect the predicted motion vectors, the merge
candidates and the transform tree. The results of the preliminary search seems
to be of limited value in the final exploration. The motion vectors can be
cached for each reference frame to speed up the fullpel search. Beyond that,
the search must be done again (type B error).
** Exploration.
Bottom-up seems to be a natural way to proceed for the exploration. If we start
at the bottom, we can avoid testing bigger blocks when the split cost is lower
than the unsplit cost. We don't know when to stop splitting if we explore
top-down. Also, a block only gets correct neighbour information if its
neighbours have been fully explored.
** RDM/RDO trade-offs.
The RDM costs are not well correlated with the RDO costs. The best mode found
with RDM is often different than the best mode found with RDO. The RDM costs
can be used to prune improbable modes before the RDO process, for example by
ignoring modes whose cost is greater than 4/3 of the best cost. When the best
RDM costs are close to each other, then only the RDO process can really tell
them apart. Experimentally, testing only a few of the best RDM modes result in
a significant quality penalty. RDO seems to matter more for H.265 than H.264.
RDO is quite slow and I could not find a way to cheat. The residual error
depends directly on the quantization error. For the quantization error to be
accurate, the DCT coefficients and the predicted pixels must also be accurate.
That doesn't leave a lot of room for fast approximations. Whether a coefficient
quantizes with a large or a small error is random. The average error can be
predicted but that doesn't help us. We care only about the actual results. The
experiments to estimate the transform tree layout using RDM have failed thus
far.
The codec needs to fit into different performance profiles. In the analysis,
the natural progression seems to be pure RDM, hybrid RDM-RDO, and pure RDO.
The pure RDM search can benefit a lot from the lookahead and fast multi-level
inter search. The benefit is marginal for the pure RDO search since
approximations are not desired and the RDO overhead dwarfs the other overheads.
** Small blocks are inefficient to process.
Small blocks like 8x8 and 4x4 are responsible for much of the quality of the
final encoding. Inter 8x8 is not selected very often, but it matters and the
only way to know whether 8x8 is useful seems to test it. Unfortunately, small
blocks are not efficient to process.
Indeed, it is hard to make use of all the available processor resources with
AVX2. A vector instruction can perform 8/16/32/64/128 integer operations during
the time its GPR counterpart processes one. Hence, we want to operate in vector
mode as much as possible.
A typical inter search is inefficient in terms of hardware utilization. With a
complete assembly implementation, much of the time is spent executing GPR
instructions:
- Merge candidate / MVP generation.
- Setting up the environment / initialization.
- Argument passing.
- MV cost computations and cost comparisons.
The effect becomes more severe as the block size decreases. As a rule of
thumbs, ProcessTime(4*NxN) >> ProcessTime(2Nx2N). There are two reasons why
this happens. First, the overhead of processing a block is mostly constant in
terms of GPR instructions, independently of its size. Second, the vector
instructions cannot operate efficiently on small block sizes because the load
operations do not fully load registers.
For SAD, if we reduce the block size to 16x16, then we need 4 loads per cycle
to load two registers fully. The hardware can deliver at most 2 loads per
cycle. Hence the hardware operates at 50% efficiency. SAD 8x8/4x4 operates at
25%/12.5% hardware efficiency.
* Implementation.
** Exploration strategy.
We use bottom-up CB processing. The split case is explored before the unsplit
case, except for 8x8 CBs.
The 16x16 CB is chosen as the sweet spot to balance between errors of type A, B
and C. Errors due to a preliminary exploration are not allowed to propagate
beyond a 16x16 boundary. This means that by the time the next 16x16 CB is
analyzed, all previous CBs have been fully analyzed, with RDO if enabled.
Backtracking to a prior CB is not allowed. Thus, a 16x16 CB and its first 8x8
CB are always explored with correct neighbour information. These properties
hold for the 32x32 and 64x64 CBs as well.
The optimal CB size is often 16x16 or below. Consequently those block sizes are
analyzed extensively by default. The exploration of a 32x32 CB may be
conditional to the 16x16 CBs all selecting an unsplit mode. This behavior may
be modified by information from the lookahead or a fast multi-level inter
search. In that case, the 16x16 CB or some of the sub-16x16 modes can be
disabled if they are deemed improbable.
The exploration of the current 16x16 CB and its four 8x8 CBs may happen in any
order and with multiple passes. Information flows both ways between the 16x16
CB and the 8x8 CBs.
The exploration of a 16x16 CB uses both the costs of the 8x8 CBs and the costs
of the 16x16 CB that have been obtained so far. Some of the 16x16 CB modes may
be tested prior to the 8x8 CB analysis and some after. The exploration of inter
16x8/8x16 may be skipped if inter 16x16 seems to be the better mode.
The four 8x8 CB children of a 16x16 CB may undergo preliminary and final
exploration. The preliminary exploration may skip intra 4x4 and inter 8x4/4x8,
avoid refining the 8x8 intra modes, skip intra altogether based on the intra
16x16 result, explore only a subset of the reference frames, and use faster
metrics.
The final exploration can be skipped if the 8x8 CBs are not deemed interesting.
The final exploration may reuse the data obtained from the preliminary
exploration, depending on the trade-off desired between errors of type B and C.
Some of the modes skipped during the preliminary exploration may be tested
depending on the relative intra/inter costs in the current 8x8 CB and the best
mode found so far in the 16x16 CB. If the best mode is currently a 16x16 mode,
then the sub-8x8 modes may be skipped, with the assumption that they are
improbable.
** RDO.
When RDO refinement is enabled, it is done after the RDM analysis in the final
exploration pass of a CB. It is hoped that doing RDO in one batch will reduce
the cache footprint for both code and data.
The RDO analysis requires the RDM data to be accurate. For non-8x8 CBs, the
neighbour information cannot change during the analysis so the RDM data is
never stale. For 8x8 CBs, the RDM data is current unless a previous 8x8 CB has
changed its selected mode. The implementation passes a flag indicating whether
the RDM data of the current 8x8 CB is current. If the data is not current, the
data is rebased against the new context by redoing the search entirely or in
part.
The RDO refinement may reject modes that are indeed improbable based on the
relative RDM costs. The thresholds are configurable to allow control over the
rejection rate.
** Intra RDO refinement.
The intra RDO refinement occurs per partition. A partition can be analyzed by
RDM to filter the modes prior to the RDO refinement. The best mode or all
likely modes can be kept. The intra search can also be done entirely in RDO.
For intra UN, there is only one partition so the RDM cost of the intra modes
are final when they are obtained in the same exploration pass.
For intra HV, there are dependencies, so the second partition may choose a
different mode in RDM after the first partition was analyzed in RDO. There are
several possible trade-offs. An initial RDM search can be done over the four
partitions to get the estimated cost of the HV mode, in the prospect of
skipping RDO refinement entirely if not competitive. The RDO refinement can be
limited to encoding the CB in RDO directly using the modes found in the initial
RDM search. The RDO refinement can also discard the initial RDM results and do
the search per partition similarly to the intra UN case.
** Intra rough mode decision for 64x64 CBs.
Rough mode decision is probably not a good idea for the 64x64 CB. Not all
neighbour pixels are reconstructed since there are 4 TBs. Also, all the
information required is probably best obtained through the 32x32 CBs.
One way to proceed is to ignore the 64x64 intra unless the four 32x32 CBs all
select the same intra mode. In that case the intra costs from the 32x32 CBs can
be precisely rebased for the 64x64 CB.
Another way to proceed is to do the RDO search on the modes selected by the
32x32 CBs (up to four different modes).
** Transform tree exploration and reconstruction.
The transform tree exploration is done by RDO. For a pure RDM pass, the
transform tree exploration occurs after the CB analysis is completed. In that
case, RDO is only used if there is ambiguity. There cannot be ambiguity if
tb-depth == 0 for both intra and inter. Otherwise the implementation must be
smart enough to detect precisely whether there is ambiguity to avoid losing
performance.
The former implementation does not save the reconstructed pixels during the RDO
process of an inter CB. Instead the inter CB is fully reconstructed after the
analysis if required. This has the advantage of not trashing the inter
prediction in the reconstructed frame (which can be a 64x64 block) while
exploring the transform tree. The pixels on the CB edges could be saved on the
stash to avoid the final reconstruction, but it's not clear that doing so is
beneficial.
There are up to 12 inter modes that may be reconstructed. Each inter mode must
save the boundary pixels multiple times in the transform tree exploration. The
operation cannot be vectorized for the vertical edge. The overhead adds up
quickly. The effort is wasted if split or intra is selected, or if RDOQ is run
only on the final CB transform tree.
Thus it is probably preferable to reconstruct inter TBs conditionally after the
CB analysis. However the algorithm should be improved. Only the TBs that touch
a CB edge and do not touch a CTB edge should be reconstructed.
For intra, in the former implementation, all the TB pixels are saved
unconditionally in the transform tree exploration. This is inefficient. Only
the boundary pixels should be saved.
The intra 4x4 mode is a special case requiring custom code to handle the intra
mode discovery and RDO process conjointly.
For 8x8, 16x16 and 32x32 intra TBs, it is often not necessary to save the
reconstructed pixels for the transform tree exploration. In general, quality is
not improved much by tb-depth > 1. When there are only two possible TT splits,
UN and HV, then both splits can be explored without saving the pixels. The
exception is a CB that allows the intra HV PB split e.g. a 16x16 CB when 8x8
CBs are disallowed. In that case, the edges of the current TB are used by the
next TBs. The boundary pixels must be saved unless there is only one possible
TT split for the TB.
Inter is chosen more often than intra on average. Therefore it seems preferable
to avoid saving the intra reconstructed pixels as much as possible during the
RDO process and reconstruct the intra CB after the analysis if it is chosen and
it is not the last CB of the CTB.
In RDM search, the intra prediction can use either the source pixels or the
reconstructed pixels of the neighbour blocks. When using the source pixels, it
is not necessary to reconstruct the CB or the intra partitions after the RDM
search. If the reconstructed pixels are desired, then for intra HV, the
partitions must be reconstructed (except the last) even in RDM mode.
The pixels on the CB edges are saved when the child CBs are analyzed and
restored if the children are selected after the analysis of the current CB.
** Interest flags and control data.
Both the RDM and the RDO analysis can operate standalone or in tandem. To unify
the code, both types of analysis assume that a prior pass has already occurred.
The exploration mechanisms are based on 3 elements:
- Interest flags stored in the CB structures (f265_cb).
- RDM costs stored in the analysis data of the local CB (f265_cb_analysis).
- Global control data (f265_analysis).
The modes that must be analyzed in a CB are specified in the CB interest flags.
Both the RDM and the RDO analysis query the interest flags prior to testing a
mode. A pure RDO search is done by setting all interest flags and exploring
every CB in order (exhaustive search).
In RDM, there are preliminary and final exploration phases. Beyond tracking
whether a mode is interesting, we must also track whether the mode has a cached
result. This is done through the RDM costs stored in f265_cb_analysis. If the
cost is set to the maximum, the mode was not tested yet. Otherwise the cached
result is available.
The global control data at the thread level specifies the exploration strategy.
That data can be modified on-the-fly, e.g. a high-level algorithm can tweak the
settings for a lower-level algorithm.
The analysis of a CB is performed by one "dumb" function (fenc_analyze_cb) that
is told which modes to test by the interest flags, the RDM costs and the
control data. The modes are tested mostly in a fixed order, with some hooks to
optionally change the order. The policy is implemented by helper functions that
are called before and after a mode is analyzed. Those helper functions are free
to implement whatever byzantine logic is desired. The resulting spaghetti does
not creep outside those functions.
The interest flags are cached in the frame so that their initial value can be
restored if the CTB or the frame is re-encoded. The lookahead and the
multi-level inter search may add information to the interest flags.
The lookahead can identify the probable luma modes for intra 8x8 and greater
blocks and add that information to the interest flags. For 4x4 intra blocks, it
doesn't seem useful to store the partition modes unless the lookahead
information is used as-is. The lookahead cannot predict the MPM set reliably
and the intra 4x4 costs depend heavily on the MPM set. However the lookahead
could note whether the intra 4x4 mode is worthy of investigation.
** Data structures.
struct f265_cb
{
...
// Bitfield of interesting modes to analyze, as follow.
// 0: CB children.
// 1-5: inter UN merge modes (5).
// 6-13: inter split modes in standard order (8).
// 14-15: intra PCM/HV modes.
// 16-50: intra UN modes (35).
uint64_t interest;
};
// Summary of the best modes of a CB. The information in this object is
// preserved for two levels of the CB hierarchy. A parent knows about its
// children but it doesn't know about the children of its children. This is
// done to reduce the cache footprint.
struct f265_cb_modes
{
// RDO cost of the best mode.
int64_t rdo_cost;
// RDM costs of the modes: best, intra, inter.
int rdm_costs[3];
// Best modes: RDM best/intra/inter, RDO. Modes are numbered according to
// the interest bitfield.
uint8_t best_modes[4];
};
// Local data of the current CB being analyzed. There is one instance of this
// object per level of the CB hierarchy. A child knows about its parent but a
// parent doesn't preserve the data of its children. This is done to reduce the
// cache footprint.
//
// The implementation allows the exploration of the children to occur at any
// point during the exploration of the current CB, with the caveat that only
// the CB modes of the immediate children are preserved.
struct f265_cb_analysis
{
// Local mode data.
// RDM mode costs, same order as the interest bitfield. Initialized to
// F265_MAX_SAD.
int mode_costs[51];
// Intra luma modes, 1 UN, 4 HV.
int8_t luma_modes[1+4];
// Intra chroma modes, 1 UN, 1 HV.
int8_t chroma_modes[1+1];
// UN merge candidates.
f265_inter_neighbour_mv un_cands[5];
// Motion vectors. The first index is the partitioning mode. The second is
// the partition number. The third is the reference list.
f265_mv mv[8][2][2];
// Reference indices. Same indices as above. The reference index is -1 if
// the list is unused.
int8_t ref_idx[8][2][2];
// Inter partition bitfields. The two indices are the same as above.
uint8_t inter_bf[8][2];
// Local bitstream element costs.
};
// Global data of the analysis.
struct f265_analysis
{
// Control flags go here.
};
struct f265_enc_thread
{
...
// Analysis data.
// Pointer to the CB modes.
// - cbm[0] corresponds to the current CB.
// - cbm[1-4] corresponds to the immediate children.
f265_cb_modes *cbm;
// Pointer to the current CB analysis object.
f265_cb_analysis *cba;
// Global data of the analysis.
f265_analysis an;
// Analysis stash.
f265_stash stash;
// Motion estimation context for the current block.
f265_me_ctx me;
// Data about the current inter block.
f265_inter_block ib;
// Array of summary/local CB data. Allocated in a stack-like fashion.
f265_cb_modes cbm_array[1+3*4];
f265_cb_analysis cba_array[4];
// YUV edges of the children of the current CB.
f265_pix child_cb_edges[3][64*2];
};
The data in f265_cb_analysis for a 8x8 CB is lost between the preliminary and
the final exploration passes. However most of the important information is
preserved in the other structures. The mode data of the 8x8 UN inter mode and
the best UN intra mode is preserved inside f265_cb. The interest bitfield
tracks the UN intra modes and the other modes that are interesting. The intra
and inter costs are preserved in f265_cb_modes.
** CB exploration pseudo-code.
fenc_analyze_cb():
- Initialize/restore costs.
- Keep track of the initial CB mode.
- Only for a 8x8 CB doing final RDO.
- Skipped if the previous 8x8 CBs are known to have changed.
- Use the interest bitfield to determine what happens if the previous 8x8 CBs
have changed.
- Option 1:
- Reset to the interest flags to retest everything.
- Option 2:
- Retest only the good modes found so far.
- For intra UN, the interest flags can be updated according to the pure
distortion costs without the mode cost in the preliminary pass.
- Stash init if RDO is enabled (CABAC/TN).
- Initial child search if enabled. This is either RDM or RDO.
- Inter UN RDM search. Control mask check and RDM cost check.
- Intra UN RDM search. Control mask check and RDM cost check.
- Early termination check for the preliminary search.
- Intra HV RDM search.
- Preliminary child search if enabled. This is RDM ony.
- Inter H2/V2 RDM search.
- Late termination check for the preliminary search.
- Inter AMP RDM search.
- Final child search if enabled. This is either RDM or RDO.
- Inter RDO refinement for each mode.
- Intra UN RDO search. Algorithm-dependent.
- Intra HV RDO search. Algorithm-dependent.
- Termination:
- Restore CABAC+TN if RDO.
- Reconstruct intra/inter as required.
- Restore child edges as required.
- Track whether the CB mode has changed for 8x8 CB.
Child exploration logic:
- Push a stash frame.
- t->cba++.
- For each child:
- t->cbm++;
- fenc_analyze_cb().
- t->cbm -= 4;
- t->cba--.
- Pop a stash frame.
- If the children have reconstructed the CB and it isn't the last CB of the
CTB:
- Save the CB boundary pixels.
- If this was a RDO analysis:
- Stash save (CABAC/TN) as needed.
- Stash reset.