Re: [FFmpeg-devel] [PATCH v4 1/2] swscale: rgb_to_yuv neon optimizations

Martin Storsjö Fri, 30 May 2025 02:07:58 -0700

On Fri, 30 May 2025, Dmitriy Kovalenko wrote:

If you with "non-performant mobile" mean small in-order cores, most of them can handle repeated 
accumulation like these even faster, if you sequence these so that all accumulations to one register is 
sequentially. E.g. first all "smlal \u_dst1\().4s", followed by all "smlal 
\u_dst2\().4s", followed by \v_dst1, followed by \v_dst2. It's worth benchmarking if you do have access 
to such cores (e.g. Cortex-A53/A55; perhaps that's also the case on the Cortex-R you mentioned in the commit 
message).


I mean generally mobile first CPUs. But I just verified even on macbook
pro interleaving instruction per the component does not enable IRL


What does "does not enable IRL" mean?

and but having a "hot-register" being multipled several times inparallel gives a difference. Here is checask results from macbook w/ myand interleaved by r/g/b component version

I'm sorry but it is very hard to interpret what you're saying here; whatis the first and second measurement?

In any case; now with this version of the patchset which actually doescompile and pass checkasm om linux, I tested reorderingrgb_to_uv_interleaved_product in the way I suggested, like this:


        smlal           \u_dst1\().4s
        smlal           \u_dst1\().4s
        smlal           \u_dst1\().4s
        smlal2          \u_dst2\().4s
        smlal2          \u_dst2\().4s
        smlal2          \u_dst2\().4s
        smlal           \v_dst1\().4s
        smlal           \v_dst1\().4s
        smlal           \v_dst1\().4s
        smlal2          \v_dst2\().4s
        smlal2          \v_dst2\().4s
        smlal2          \v_dst2\().4s

Such accumulation orders can sometimes give significant speedups onin-order cores like Cortex A53 and A55. In this case it didn't make anydifference, so the there's no need to investigate it further.

Does this make any practical difference, as we're just storing thelower 32 bits anyway?
Not really but I found it quite confusing at first becuase it looks like
this instruction will imply narrowing, but looking into the w13 / w13 is
much more clear what is going on.

If it doesn't make any difference, then don't change it. The fewer changesin a patch, the easier it is to accept the patch. Especially if you areoptimizing code, don't include unrelated changes in the same patch. If youfeel strongly that it should be changed for readability/understandabilityreasons, then factor out that change to a separate patch.


// Martin

_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH v4 1/2] swscale: rgb_to_yuv neon optimizations

Reply via email to