On Sun, 09 Mar 2025 17:57:48 -0700 Rémi Denis-Courmont <r...@remlab.net> wrote: > > > Le 9 mars 2025 12:57:47 GMT-07:00, Niklas Haas <ffm...@haasn.xyz> a écrit : > >On Sun, 09 Mar 2025 11:18:04 -0700 Rémi Denis-Courmont <r...@remlab.net> > >wrote: > >> Hi, > >> > >> Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffm...@haasn.xyz> a écrit : > >> >https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt > >> > >> >I have spent the past week or so ironing > >> >I wanted to post it here to gather some feedback on the approach. Where > >> >does > >> >it fall on the "madness" scale? Is the new operations and optimizer design > >> >comprehensible? Am I trying too hard to reinvent compilers? Are there any > >> >platforms where the high number of function calls per frame would be > >> >probitively expensive? What are the thoughts on the float-first approach? > >> >See > >> >also the list of limitations and improvement ideas at the bottom of my > >> >design > >> >document. > >> > >> Using floats internally may be fine if there's (almost) never any > >> spillage, but that necessarily implies custom calling conventions. And > >> won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 > >> vectors. On Arm NEON, it would be even worse as scalars/constants need to > >> be stored in vectors as well. > > > >I think that a custom calling convention is not as unreasonable as it may > >sound, > >and will actually be easier to implement than the standard calling convention > >since functions will not have to deal with pixel load/store, nor will there > >be > >any need for "fused" versions of operations (whose only purpose is to avoid > >the roundtrip through L1). > > > >The pixel chunk size is easily changed; it is a compile time constant and > >there > >are no strict requirements on it. If RISC-V (or any other platform) struggles > >with storing 32 floats in vector registers, we could go down to 16 (or even > >8); > >the number 32 was merely chosen by benchmarking and not through any careful > >design consideration. > > It can't be a compile time constant on RVV nor (if it's ever introduced) SVE > because they are scalable. I doubt that a compile-time constant will work > well across all variants of x86 as well, but not that I'd know.
It's my understanding that on existing RVV implementations, the number of cycles needed to execute an m4/m2 operation is roughly 4x/2x the cost of an equivalent m1 operation. If this continues to be the case, the underlying VLEN of the implementation should not matter much, even with a compile time constant chunk size, as long as it does not greatly exceed 512. That said, it's indeed possible that on RISC-V we may be better off with a dynamic chunk size. I will hold off on that judgement until we have numbers. > >Do you have access to anything with decent RVV F32 support that we could use > >for testing? It's my understanding that existing RVV implementations have > >been > >rather primitive. > > Float is quite okay on RVV. It is faster than integers on some lavc audio > loops already. > > That said, I only have access to TH-C908 (128-bit) and ST-X60 (256-bit), as > before, and I haven't been contacted to get access anything better. The X60 > is used on FATE. I saw that recent versions of both GCC and clang are quite capable of generating autovectorized RVV code, so maybe we could just give it a try to see where the performance figures land with the existing C templates. > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".