Hi James,
Thank you for your review. I'd like to clarify the difference between the two
approaches:
**Clarification:**
My patch optimizes `ff_nal_find_startcode` in libavformat/nal.c, which is
different from the `ff_startcode_find_candidate` hook you mentioned under
libavcodec/h264dsp.c.
- `ff_startcode_find_candidate`: Returns offset to first zero byte, requires
upper layer validation
- `ff_nal_find_startcode`: Returns pointer to complete startcode (00 00 01),
used by H.264 demuxer
**Test Environment:**
- Platform: Raspberry Pi 5 (ARM Cortex-A76, AArch64)
- Compiler: GCC 14.2.0 with -O3 -march=armv8-a
- Test file: 1080p H.264 video, 22.88 MB
- Total NALU startcodes found: 1,224
**Test Methodology:**
I compared two approaches:
**Method 1 (baseline):** Use `ff_startcode_find_candidate` + C validation
(current FFmpeg approach)
```c
// Simplified pseudo-code
std::vector<size_t> find_all_startcode_positions(const uint8_t* data, size_t
size) {
std::vector<size_t> positions;
size_t i = 0;
while (i < size) {
// Step 1: Fast search for zero byte
int offset = ff_startcode_find_candidate(data + i, size -
i);
if (offset >= size - i) break;
i += offset;
// Step 2: Validate if it's a complete startcode (00 00
01)
if (i + 2 < size && data[i] == 0 &&
data[i+1] == 0) {
if (data[i+2] == 1) {
positions.push_back(i);
i += 3;
continue;
} else if (i + 3 < size &&
data[i+2] == 0 && data[i+3] == 1) {
positions.push_back(i);
i += 4;
continue;
}
}
i++;
}
return positions;
}
```
Method 2 (NEON optimized): Use ff_nal_find_startcode_neon directly
```cpp
std::vector<size_t> find_all_startcode_positions_neon(const uint8_t* data,
size_t size) {
std::vector<size_t> positions;
const uint8_t* p = data;
const uint8_t* end = data + size;
while (p < end) {
// Directly find complete startcode
const uint8_t* start = ff_nal_find_startcode_neon(p, end);
// Skip zero bytes before NALU header
while (start < end && *start == 0) start++;
if (start >= end) break;
positions.push_back(start - data);
p = start;
}
return positions;
}
```
Performance Results (1000 iterations):
- Method 1 (find zero + validate): 5,454,680 μs
- Method 2 (NEON direct search): 1,741,280 μs
- Speedup: 3.13x
Why this optimization is effective:
The NEON version detects "00" pattern (two consecutive zeros) instead of single
zeros:
Test file analysis (22.88 MB 1080p H.264):
- Single zero bytes: 95,673 (98.1% false positive rate)
- Valid startcodes: 1,224
- With "00" pattern: Only 22.8% of 64-byte blocks need detailed checking
- 77.2% of blocks can be skipped entirely
This optimization specifically improves H.264 demuxing performance on ARM
platforms.
Should I modify the commit message to better clarify this distinction?
Best regards,
He Zuoqiang
原始邮件
发件人:Rémi Denis-Courmont via ffmpeg-devel <[email protected]>
发件时间:2026年1月13日 18:26
收件人:hezuoqiang--- via ffmpeg-devel <[email protected]>
抄送:Zuoqiang He <[email protected]>, Rémi Denis-Courmont
<[email protected]>
主题:[FFmpeg-devel] Re: [PATCH] libavformat/nal: add ARM NEON optimization
forff_nal_find_startcode
Nihao,
There already is a hook for this purpose under h264dsp, and it's already used on some other ISAs. So there should be no need to add a new one.
It's also probably faster to just look for a nul byte in assembler and let the C code manually check for the full 32-bit start code. This is basically just `strnlen()`.
Br,
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]