On 2026/1/13 16:46, Andy Shevchenko wrote:
> On Tue, Jan 13, 2026 at 04:27:42PM +0800, Feng Jiang wrote:
>> Introduce a benchmark to compare the architecture-optimized strlen()
>> implementation against the generic C version (__generic_strlen).
>>
>> The benchmark uses a table-driven approach to evaluate performance
>> across different string lengths (short, medium, and long). It employs
>> ktime_get() for timing and get_random_bytes() followed by null-byte
>> filtering to generate test data that prevents early termination.
>>
>> This helps in quantifying the performance gains of architecture-specific
>> optimizations on various platforms.
> 
> ...
> 
>> +static void string_test_strlen_bench(struct kunit *test)
>> +{
>> +    char *buf;
>> +    size_t buf_len, iters;
>> +    ktime_t start, end;
>> +    u64 time_arch, time_generic;
>> +
>> +    buf_len = get_max_bench_len(bench_cases, ARRAY_SIZE(bench_cases)) + 1;
>> +
>> +    buf = kunit_kzalloc(test, buf_len, GFP_KERNEL);
>> +    KUNIT_ASSERT_NOT_ERR_OR_NULL(test, buf);
>> +
>> +    for (size_t i = 0; i < ARRAY_SIZE(bench_cases); i++) {
>> +            get_random_nonzero_bytes(buf, bench_cases[i].len);
>> +            buf[bench_cases[i].len] = '\0';
>> +
>> +            iters = bench_cases[i].iterations;
>> +
>> +            /* 1. Benchmark the architecture-optimized version */
>> +            start = ktime_get();
>> +            for (unsigned int j = 0; j < iters; j++) {
>> +                    OPTIMIZER_HIDE_VAR(buf);
>> +                    (void)strlen(buf);
> 
> First Q: Are you sure the compiler doesn't replace this with 
> __builtin_strlen() ?
> 
>> +            }
>> +            end = ktime_get();
>> +            time_arch = ktime_to_ns(ktime_sub(end, start));
>> +
>> +            /* 2. Benchmark the generic C version */
>> +            start = ktime_get();
>> +            for (unsigned int j = 0; j < iters; j++) {
>> +                    OPTIMIZER_HIDE_VAR(buf);
>> +                    (void)__generic_strlen(buf);
>> +            }
> 
> Are you sure the warmed up caches do not affect the benchmark? I think you 
> need
> to flush / make caches dirty or so on each iteration.
> 
>> +            end = ktime_get();
>> +            time_generic = ktime_to_ns(ktime_sub(end, start));
>> +
>> +            string_bench_report(test, "strlen", &bench_cases[i],
>> +                            time_arch, time_generic);
>> +    }
>> +}
> 
> 

Thank you for the catch. You are absolutely correct—the 2500x figure is heavily
distorted and does not reflect real-world performance.

I've found that by using a volatile function pointer to call the implementations
(instead of direct calls), the results returned to a realistic range. It appears
the previous benchmark logic allowed the compiler to over-optimize the test loop
in ways that skewed the data.

I will refactor the benchmark logic in v3, specifically referencing the crc32
KUnit implementation (e.g., using warm-up loops and adding preempt_disable()
to eliminate context-switch interference) to ensure the data is robust and 
accurate.

-- 
With Best Regards,
Feng Jiang


Reply via email to