On 18/06/14 07:52, Vittorio Giovara wrote: > On Tue, Apr 22, 2014 at 8:52 PM, Ben Avison <[email protected]> wrote: >> The previous implementation of the parser made four passes over each input >> buffer (reduced to two if the container format already guaranteed the input >> buffer corresponded to frames, such as with MKV). But these buffers are >> often 200K in size, certainly enough to flush the data out of L1 cache, and >> for many CPUs, all the way out to main memory. The passes were: >> >> 1) locate frame boundaries (not needed for MKV etc) >> 2) copy the data into a contiguous block (not needed for MKV etc) >> 3) locate the start codes within each frame >> 4) unescape the data between start codes >> >> After this, the unescaped data was parsed to extract certain header fields, >> but because the unescape operation was so large, this was usually also >> effectively operating on uncached memory. Most of the unescaped data was >> simply thrown away and never processed further. Only step 2 - because it >> used memcpy - was using prefetch, making things even worse. >> >> This patch reorganises these steps so that, aside from the copying, the >> operations are performed in parallel, maximising cache utilisation. No more >> than the worst-case number of bytes needed for header parsing is unescaped. >> Most of the data is, in practice, only read in order to search for a start >> code, for which optimised implementations already existed in the H264 codec >> (notably the ARM version uses prefetch, so we end up doing both remaining >> passes at maximum speed). For MKV files, we know when we've found the last >> start code of interest in a given frame, so we are able to avoid doing even >> that one remaining pass for most of the buffer. >> >> In some use-cases (such as the Raspberry Pi) video decode is handled by the >> GPU, but the entire elementary stream is still fed through the parser to >> pick out certain elements of the header which are necessary to manage the >> decode process. As you might expect, in these cases, the performance of the >> parser is significant. >> >> To measure parser performance, I used the same VC-1 elementary stream in >> either an MPEG-2 transport stream or a MKV file, and fed it through avconv >> with -c:v copy -c:a copy -f null. These are the gperftools counts for >> those streams, both filtered to only include vc1_parse() and its callees, >> and unfiltered (to include the whole binary). Lower numbers are better: >> >> Before After >> File Filtered Mean StdDev Mean StdDev Confidence Change >> M2TS No 861.7 8.2 650.5 8.1 100.0% +32.5% >> MKV No 868.9 7.4 731.7 9.0 100.0% +18.8% >> M2TS Yes 250.0 11.2 27.2 3.4 100.0% +817.9% >> MKV Yes 149.0 12.8 1.7 0.8 100.0% +8526.3% >> >> Yes, that last case shows vc1_parse() running 86 times faster! The M2TS >> case does show a larger absolute improvement though, since it was worse >> to begin with. >> >> This patch has been tested with the FATE suite (albeit on x86 for speed). >> --- >> libavcodec/vc1_parser.c | 276 >> ++++++++++++++++++++++++++++++----------------- >> 1 files changed, 175 insertions(+), 101 deletions(-) >> >> diff --git a/libavcodec/vc1_parser.c b/libavcodec/vc1_parser.c >> index 1bedd98..713ffff 100644 >> --- a/libavcodec/vc1_parser.c >> +++ b/libavcodec/vc1_parser.c >> @@ -30,117 +30,84 @@ >> #include "vc1.h" >> #include "get_bits.h" >> >> +/** The maximum number of bytes of a sequence, entry point or >> + * frame header whose values we pay any attention to */ >> +#define UNESCAPED_THRESHOLD 37 >> + >> +/** The maximum number of bytes of a sequence, entry point or >> + * frame header which must be valid memory (because they are >> + * used to update the bitstream cache in skip_bits() calls) >> + */ >> +#define UNESCAPED_LIMIT 144 >> + >> +typedef enum { >> + NO_MATCH, >> + ONE_ZERO, >> + TWO_ZEROS, >> + ONE >> +} VC1ParseSearchState; >> + >> typedef struct { >> ParseContext pc; >> VC1Context v; >> + uint8_t prev_start_code; >> + size_t bytes_to_skip; >> + uint8_t unesc_buffer[UNESCAPED_LIMIT]; >> + size_t unesc_index; >> + VC1ParseSearchState search_state; >> } VC1ParseContext; >> >> -static void vc1_extract_headers(AVCodecParserContext *s, AVCodecContext >> *avctx, >> - const uint8_t *buf, int buf_size) >> +static void vc1_extract_header(AVCodecParserContext *s, AVCodecContext >> *avctx, >> + const uint8_t *buf, int buf_size) >> { >> + /* Parse the header we just finished unescaping */ >> VC1ParseContext *vpc = s->priv_data; >> GetBitContext gb; >> - const uint8_t *start, *end, *next; >> - uint8_t *buf2 = av_mallocz(buf_size + FF_INPUT_BUFFER_PADDING_SIZE); >> - >> vpc->v.s.avctx = avctx; >> vpc->v.parse_only = 1; >> - next = buf; >> - s->repeat_pict = 0; >> - >> - for(start = buf, end = buf + buf_size; next < end; start = next){ >> - int buf2_size, size; >> - >> - next = find_next_marker(start + 4, end); >> - size = next - start - 4; >> - buf2_size = vc1_unescape_buffer(start + 4, size, buf2); >> - init_get_bits(&gb, buf2, buf2_size * 8); >> - if(size <= 0) continue; >> - switch(AV_RB32(start)){ >> - case VC1_CODE_SEQHDR: >> - ff_vc1_decode_sequence_header(avctx, &vpc->v, &gb); >> - break; >> - case VC1_CODE_ENTRYPOINT: >> - ff_vc1_decode_entry_point(avctx, &vpc->v, &gb); >> - break; >> - case VC1_CODE_FRAME: >> - if(vpc->v.profile < PROFILE_ADVANCED) >> - ff_vc1_parse_frame_header (&vpc->v, &gb); >> - else >> - ff_vc1_parse_frame_header_adv(&vpc->v, &gb); >> - >> - /* keep AV_PICTURE_TYPE_BI internal to VC1 */ >> - if (vpc->v.s.pict_type == AV_PICTURE_TYPE_BI) >> - s->pict_type = AV_PICTURE_TYPE_B; >> - else >> - s->pict_type = vpc->v.s.pict_type; >> - >> - if (avctx->ticks_per_frame > 1){ >> - // process pulldown flags >> - s->repeat_pict = 1; >> - // Pulldown flags are only valid when 'broadcast' has been >> set. >> - // So ticks_per_frame will be 2 >> - if (vpc->v.rff){ >> - // repeat field >> - s->repeat_pict = 2; >> - }else if (vpc->v.rptfrm){ >> - // repeat frames >> - s->repeat_pict = vpc->v.rptfrm * 2 + 1; >> - } >> - } >> - >> - if (vpc->v.broadcast && vpc->v.interlace && !vpc->v.psf) >> - s->field_order = vpc->v.tff ? AV_FIELD_TT : AV_FIELD_BB; >> - else >> - s->field_order = AV_FIELD_PROGRESSIVE; >> - >> - break; >> - } >> - } >> + init_get_bits(&gb, buf, buf_size * 8); >> + switch (vpc->prev_start_code) { >> + case VC1_CODE_SEQHDR & 0xFF: >> + ff_vc1_decode_sequence_header(avctx, &vpc->v, &gb); >> + break; >> + case VC1_CODE_ENTRYPOINT & 0xFF: >> + ff_vc1_decode_entry_point(avctx, &vpc->v, &gb); >> + break; >> + case VC1_CODE_FRAME & 0xFF: >> + if(vpc->v.profile < PROFILE_ADVANCED) >> + ff_vc1_parse_frame_header (&vpc->v, &gb); >> + else >> + ff_vc1_parse_frame_header_adv(&vpc->v, &gb); >> >> - av_free(buf2); >> -} >> + /* keep AV_PICTURE_TYPE_BI internal to VC1 */ >> + if (vpc->v.s.pict_type == AV_PICTURE_TYPE_BI) >> + s->pict_type = AV_PICTURE_TYPE_B; >> + else >> + s->pict_type = vpc->v.s.pict_type; >> >> -/** >> - * Find the end of the current frame in the bitstream. >> - * @return the position of the first byte of the next frame, or -1 >> - */ >> -static int vc1_find_frame_end(ParseContext *pc, const uint8_t *buf, >> - int buf_size) { >> - int pic_found, i; >> - uint32_t state; >> - >> - pic_found= pc->frame_start_found; >> - state= pc->state; >> - >> - i=0; >> - if(!pic_found){ >> - for(i=0; i<buf_size; i++){ >> - state= (state<<8) | buf[i]; >> - if(state == VC1_CODE_FRAME || state == VC1_CODE_FIELD){ >> - i++; >> - pic_found=1; >> - break; >> + if (avctx->ticks_per_frame > 1){ >> + // process pulldown flags >> + s->repeat_pict = 1; >> + // Pulldown flags are only valid when 'broadcast' has been set. >> + // So ticks_per_frame will be 2 >> + if (vpc->v.rff){ >> + // repeat field >> + s->repeat_pict = 2; >> + }else if (vpc->v.rptfrm){ >> + // repeat frames >> + s->repeat_pict = vpc->v.rptfrm * 2 + 1; >> } >> + }else{ >> + s->repeat_pict = 0; >> } >> - } >> >> - if(pic_found){ >> - /* EOF considered as end of frame */ >> - if (buf_size == 0) >> - return 0; >> - for(; i<buf_size; i++){ >> - state= (state<<8) | buf[i]; >> - if(IS_MARKER(state) && state != VC1_CODE_FIELD && state != >> VC1_CODE_SLICE){ >> - pc->frame_start_found=0; >> - pc->state=-1; >> - return i-3; >> - } >> - } >> + if (vpc->v.broadcast && vpc->v.interlace && !vpc->v.psf) >> + s->field_order = vpc->v.tff ? AV_FIELD_TT : AV_FIELD_BB; >> + else >> + s->field_order = AV_FIELD_PROGRESSIVE; >> + >> + break; >> } >> - pc->frame_start_found= pic_found; >> - pc->state= state; >> - return END_NOT_FOUND; >> } >> >> static int vc1_parse(AVCodecParserContext *s, >> @@ -148,22 +115,125 @@ static int vc1_parse(AVCodecParserContext *s, >> const uint8_t **poutbuf, int *poutbuf_size, >> const uint8_t *buf, int buf_size) >> { >> + /* Here we do the searching for frame boundaries and headers at >> + * the same time. Only a minimal amount at the start of each >> + * header is unescaped. */ >> VC1ParseContext *vpc = s->priv_data; >> - int next; >> + int pic_found = vpc->pc.frame_start_found; >> + uint8_t *unesc_buffer = vpc->unesc_buffer; >> + size_t unesc_index = vpc->unesc_index; >> + VC1ParseSearchState search_state = vpc->search_state; >> + int next = END_NOT_FOUND; >> + int i = vpc->bytes_to_skip; >> + >> + if (pic_found && buf_size == 0) { >> + /* EOF considered as end of frame */ >> + memset(unesc_buffer + unesc_index, 0, UNESCAPED_THRESHOLD - >> unesc_index); >> + vc1_extract_header(s, avctx, unesc_buffer, unesc_index); >> + next = 0; >> + } >> + while (i < buf_size) { >> + int start_code_found = 0; >> + uint8_t b; >> + while (i < buf_size && unesc_index < UNESCAPED_THRESHOLD) { >> + b = buf[i++]; >> + unesc_buffer[unesc_index++] = b; >> + if (search_state <= ONE_ZERO) >> + search_state = b ? NO_MATCH : search_state + 1; >> + else if (search_state == TWO_ZEROS) { >> + if (b == 1) >> + search_state = ONE; >> + else if (b > 1) { >> + if (b == 3) >> + unesc_index--; // swallow emulation prevention byte >> + search_state = NO_MATCH; >> + } >> + } >> + else { // search_state == ONE >> + // Header unescaping terminates early due to detection of >> next start code >> + search_state = NO_MATCH; >> + start_code_found = 1; >> + break; >> + } >> + } >> + if ((s->flags & PARSER_FLAG_COMPLETE_FRAMES) && >> + unesc_index >= UNESCAPED_THRESHOLD && >> + vpc->prev_start_code == (VC1_CODE_FRAME & 0xFF)) >> + { >> + // No need to keep scanning the rest of the buffer for >> + // start codes if we know it contains a complete frame and >> + // we've already unescaped all we need of the frame header >> + vc1_extract_header(s, avctx, unesc_buffer, unesc_index); >> + break; >> + } >> + if (unesc_index >= UNESCAPED_THRESHOLD && !start_code_found) { >> + while (i < buf_size) { >> + if (search_state == NO_MATCH) { >> + i += vpc->v.vc1dsp.vc1_find_start_code_candidate(buf + >> i, buf_size - i); >> + if (i < buf_size) { >> + search_state = ONE_ZERO; >> + } >> + i++; >> + } else { >> + b = buf[i++]; >> + if (search_state == ONE_ZERO) >> + search_state = b ? NO_MATCH : TWO_ZEROS; >> + else if (search_state == TWO_ZEROS) { >> + if (b >= 1) >> + search_state = b == 1 ? ONE : NO_MATCH; >> + } >> + else { // search_state == ONE >> + search_state = NO_MATCH; >> + start_code_found = 1; >> + break; >> + } >> + } >> + } >> + } >> + if (start_code_found) { >> + vc1_extract_header(s, avctx, unesc_buffer, unesc_index); >> + >> + vpc->prev_start_code = b; >> + unesc_index = 0; >> + >> + if (!(s->flags & PARSER_FLAG_COMPLETE_FRAMES)) { >> + if (!pic_found && (b == (VC1_CODE_FRAME & 0xFF) || b == >> (VC1_CODE_FIELD & 0xFF))) { >> + pic_found = 1; >> + } >> + else if (pic_found && b != (VC1_CODE_FIELD & 0xFF) && b != >> (VC1_CODE_SLICE & 0xFF)) { >> + next = i - 4; >> + pic_found = b == (VC1_CODE_FRAME & 0xFF); >> + break; >> + } >> + } >> + } >> + } >> >> - if(s->flags & PARSER_FLAG_COMPLETE_FRAMES){ >> - next= buf_size; >> - }else{ >> - next= vc1_find_frame_end(&vpc->pc, buf, buf_size); >> + vpc->pc.frame_start_found = pic_found; >> + vpc->unesc_index = unesc_index; >> + vpc->search_state = search_state; >> >> + if (s->flags & PARSER_FLAG_COMPLETE_FRAMES) { >> + next = buf_size; >> + } else { >> if (ff_combine_frame(&vpc->pc, next, &buf, &buf_size) < 0) { >> + vpc->bytes_to_skip = 0; >> *poutbuf = NULL; >> *poutbuf_size = 0; >> return buf_size; >> } >> } >> >> - vc1_extract_headers(s, avctx, buf, buf_size); >> + /* If we return with a valid pointer to a combined frame buffer >> + * then on the next call then we'll have been unhelpfully rewound >> + * by up to 4 bytes (depending upon whether the start code >> + * overlapped the input buffer, and if so by how much). We don't >> + * want this: it will either cause spurious second detections of >> + * the start code we've already seen, or cause extra bytes to be >> + * inserted at the start of the unescaped buffer. */ >> + vpc->bytes_to_skip = 4; >> + if (next < 0) >> + vpc->bytes_to_skip += next; >> >> *poutbuf = buf; >> *poutbuf_size = buf_size; >> @@ -194,6 +264,10 @@ static av_cold int vc1_parse_init(AVCodecParserContext >> *s) >> { >> VC1ParseContext *vpc = s->priv_data; >> vpc->v.s.slice_context_count = 1; >> + vpc->prev_start_code = 0; >> + vpc->bytes_to_skip = 0; >> + vpc->unesc_index = 0; >> + vpc->search_state = NO_MATCH; >> return ff_vc1_init_common(&vpc->v); >> } > > So since the results are good and fate passes, can we merge it? >
probably. lu _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
