Hi James,

Thank you for your review. I'd like to clarify the difference between the two 
approaches:

**Clarification:**

My patch optimizes `ff_nal_find_startcode` in libavformat/nal.c, which is 
different from the `ff_startcode_find_candidate` hook you mentioned under 
libavcodec/h264dsp.c.

- `ff_startcode_find_candidate`: Returns offset to first zero byte, requires 
upper layer validation
- `ff_nal_find_startcode`: Returns pointer to complete startcode (00 00 01), 
used by H.264 demuxer

**Test Environment:**
- Platform: Raspberry Pi 5 (ARM Cortex-A76, AArch64)
- Compiler: GCC 14.2.0 with -O3 -march=armv8-a
- Test file: 1080p H.264 video, 22.88 MB
- Total NALU startcodes found: 1,224

**Test Methodology:**

I compared two approaches:

**Method 1 (baseline):** Use `ff_startcode_find_candidate` + C validation 
(current FFmpeg approach)

```c
// Simplified pseudo-code
std::vector<size_t&gt; find_all_startcode_positions(const uint8_t* data, size_t 
size) {
 &nbsp; std::vector<size_t&gt; positions;
 &nbsp; size_t i = 0;

 &nbsp; while (i < size) {
 &nbsp; &nbsp; &nbsp; // Step 1: Fast search for zero byte
 &nbsp; &nbsp; &nbsp; int offset = ff_startcode_find_candidate(data + i, size - 
i);
 &nbsp; &nbsp; &nbsp; if (offset &gt;= size - i) break;
 &nbsp; &nbsp; &nbsp; i += offset;

 &nbsp; &nbsp; &nbsp; // Step 2: Validate if it's a complete startcode (00 00 
01)
 &nbsp; &nbsp; &nbsp; if (i + 2 < size &amp;&amp; data[i] == 0 &amp;&amp; 
data[i+1] == 0) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (data[i+2] == 1) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; positions.push_back(i);
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i += 3;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else if (i + 3 < size &amp;&amp; 
data[i+2] == 0 &amp;&amp; data[i+3] == 1) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; positions.push_back(i);
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i += 4;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
 &nbsp; &nbsp; &nbsp; }
 &nbsp; &nbsp; &nbsp; i++;
 &nbsp; }
 &nbsp; return positions;
}
```
Method 2 (NEON optimized): Use ff_nal_find_startcode_neon directly
```cpp
std::vector<size_t&gt; find_all_startcode_positions_neon(const uint8_t* data, 
size_t size) {
 &nbsp; std::vector<size_t&gt; positions;
 &nbsp; const uint8_t* p = data;
 &nbsp; const uint8_t* end = data + size;

 &nbsp; while (p < end) {
 &nbsp; &nbsp; &nbsp; // Directly find complete startcode
 &nbsp; &nbsp; &nbsp; const uint8_t* start = ff_nal_find_startcode_neon(p, end);

 &nbsp; &nbsp; &nbsp; // Skip zero bytes before NALU header
 &nbsp; &nbsp; &nbsp; while (start < end &amp;&amp; *start == 0) start++;
 &nbsp; &nbsp; &nbsp; if (start &gt;= end) break;

 &nbsp; &nbsp; &nbsp; positions.push_back(start - data);
 &nbsp; &nbsp; &nbsp; p = start;
 &nbsp; }
 &nbsp; return positions;
}
```
Performance Results (1000 iterations):
- Method 1 (find zero + validate): 5,454,680 μs
- Method 2 (NEON direct search): &nbsp;1,741,280 μs
- Speedup: 3.13x

Why this optimization is effective:

The NEON version detects "00" pattern (two consecutive zeros) instead of single 
zeros:

Test file analysis (22.88 MB 1080p H.264):
- Single zero bytes: 95,673 (98.1% false positive rate)
- Valid startcodes: 1,224
- With "00" pattern: Only 22.8% of 64-byte blocks need detailed checking
- 77.2% of blocks can be skipped entirely

This optimization specifically improves H.264 demuxing performance on ARM 
platforms.

Should I modify the commit message to better clarify this distinction?

Best regards,
He Zuoqiang





         原始邮件
         
       
发件人:Rémi Denis-Courmont via ffmpeg-devel <[email protected]&gt;
发件时间:2026年1月13日 18:26
收件人:hezuoqiang--- via ffmpeg-devel <[email protected]&gt;
抄送:Zuoqiang He <[email protected]&gt;, Rémi Denis-Courmont 
<[email protected]&gt;
主题:[FFmpeg-devel] Re: [PATCH] libavformat/nal: add ARM NEON optimization 
forff_nal_find_startcode



       Nihao,

There&nbsp;already&nbsp;is&nbsp;a&nbsp;hook&nbsp;for&nbsp;this&nbsp;purpose&nbsp;under&nbsp;h264dsp,&nbsp;and&nbsp;it's&nbsp;already&nbsp;used&nbsp;on&nbsp;some&nbsp;other&nbsp;ISAs.&nbsp;So&nbsp;there&nbsp;should&nbsp;be&nbsp;no&nbsp;need&nbsp;to&nbsp;add&nbsp;a&nbsp;new&nbsp;one.

It's&nbsp;also&nbsp;probably&nbsp;faster&nbsp;to&nbsp;just&nbsp;look&nbsp;for&nbsp;a&nbsp;nul&nbsp;byte&nbsp;in&nbsp;assembler&nbsp;and&nbsp;let&nbsp;the&nbsp;C&nbsp;code&nbsp;manually&nbsp;check&nbsp;for&nbsp;the&nbsp;full&nbsp;32-bit&nbsp;start&nbsp;code.&nbsp;This&nbsp;is&nbsp;basically&nbsp;just&nbsp;`strnlen()`.

Br,
_______________________________________________
ffmpeg-devel&nbsp;mailing&nbsp;list&nbsp;--&nbsp;[email protected]
To&nbsp;unsubscribe&nbsp;send&nbsp;an&nbsp;email&nbsp;to&nbsp;[email protected]
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to