On Sun, 09 Mar 2025 17:57:48 -0700 Rémi Denis-Courmont <r...@remlab.net> wrote:
>
>
> Le 9 mars 2025 12:57:47 GMT-07:00, Niklas Haas <ffm...@haasn.xyz> a écrit :
> >On Sun, 09 Mar 2025 11:18:04 -0700 Rémi Denis-Courmont <r...@remlab.net> 
> >wrote:
> >> Hi,
> >>
> >> Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffm...@haasn.xyz> a écrit :
> >> >https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
> >>
> >> >I have spent the past week or so ironing
> >> >I wanted to post it here to gather some feedback on the approach. Where 
> >> >does
> >> >it fall on the "madness" scale? Is the new operations and optimizer design
> >> >comprehensible? Am I trying too hard to reinvent compilers? Are there any
> >> >platforms where the high number of function calls per frame would be
> >> >probitively expensive? What are the thoughts on the float-first approach? 
> >> >See
> >> >also the list of limitations and improvement ideas at the bottom of my 
> >> >design
> >> >document.
> >>
> >> Using floats internally may be fine if there's (almost) never any 
> >> spillage, but that necessarily implies custom calling conventions. And 
> >> won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 
> >> vectors. On Arm NEON, it would be even worse as scalars/constants need to 
> >> be stored in vectors as well.
> >
> >I think that a custom calling convention is not as unreasonable as it may 
> >sound,
> >and will actually be easier to implement than the standard calling convention
> >since functions will not have to deal with pixel load/store, nor will there 
> >be
> >any need for "fused" versions of operations (whose only purpose is to avoid
> >the roundtrip through L1).
> >
> >The pixel chunk size is easily changed; it is a compile time constant and 
> >there
> >are no strict requirements on it. If RISC-V (or any other platform) struggles
> >with storing 32 floats in vector registers, we could go down to 16 (or even 
> >8);
> >the number 32 was merely chosen by benchmarking and not through any careful
> >design consideration.
>
> It can't be a compile time constant on RVV nor (if it's ever introduced) SVE 
> because they are scalable. I doubt that a compile-time constant will work 
> well across all variants of x86 as well, but not that I'd know.

It's my understanding that on existing RVV implementations, the number of
cycles needed to execute an m4/m2 operation is roughly 4x/2x the cost of
an equivalent m1 operation.

If this continues to be the case, the underlying VLEN of the implementation
should not matter much, even with a compile time constant chunk size, as long
as it does not greatly exceed 512.

That said, it's indeed possible that on RISC-V we may be better off with a
dynamic chunk size. I will hold off on that judgement until we have numbers.

> >Do you have access to anything with decent RVV F32 support that we could use
> >for testing? It's my understanding that existing RVV implementations have 
> >been
> >rather primitive.
>
> Float is quite okay on RVV. It is faster than integers on some lavc audio 
> loops already.
>
> That said, I only have access to TH-C908 (128-bit) and  ST-X60 (256-bit), as 
> before, and I haven't been contacted to get access anything better. The X60 
> is used on FATE.

I saw that recent versions of both GCC and clang are quite capable of generating
autovectorized RVV code, so maybe we could just give it a try to see where
the performance figures land with the existing C templates.

> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to