On Wed, Oct 26, 2016 at 04:21:14PM +0200, Hendrik Leppkes wrote: > On Wed, Oct 26, 2016 at 3:54 PM, Michael Niedermayer > <mich...@niedermayer.cc> wrote: > > On Tue, Oct 25, 2016 at 12:00:01AM +0200, Hendrik Leppkes wrote: > >> On Mon, Oct 24, 2016 at 10:31 PM, Ronald S. Bultje <rsbul...@gmail.com> > >> wrote: > >> > Hi, > >> > > >> > On Mon, Oct 24, 2016 at 4:26 PM, Henrik Gramner <hen...@gramner.com> > >> > wrote: > >> > > >> >> On Mon, Oct 24, 2016 at 9:59 PM, Ronald S. Bultje <rsbul...@gmail.com> > >> >> wrote: > >> >> > Good idea to reference Hendrik Gramner here, who keeps insisting we > >> >> > get > >> >> rid > >> >> > of all MMX code in ffmpeg (at least as an option) for future Intel > >> >> > CPUs > >> >> in > >> >> > which MMX will be deprecated. > >> >> > >> >> Replacing MMX with SSE2 is indeed the most "proper" fix in my opinion, > >> >> but it's a fair amount of work and not done in an evening. > >> >> > >> >> The fact that a lot of assembly lacks unit tests is certainly not > >> >> helping in that regard. > >> >> > >> >> Some MMX instructions are slower than the equivalent SSE2 code on > >> >> Skylake. Intel hasn't officially commented on (as far as I know at > >> >> least) if we should expect this trend to continue, but they certainly > >> >> seem to treat MMX as legacy. > >> >> > >> >> I doubt they would completely remove support for it though, backwards > >> >> compatibility is a big selling-point for x86. > >> > > >> > > >> > Well, it gives us another way of fixing this issue (on x86-64 only): have > >> > sse2 implementations for all code that has a mmx (register) path right > >> > now. > >> > > >> > >> I don't think the argument for pre-sse2 CPUs is that strong on 32-bit > >> systems, either. > > > > SSE2 was initially not faster than MMX as CPUs implemented it as 2 > > MMX operations internally not having a full width SIMD unit for SSE* > > so there would be a performace loss on some x86-32 CPUs if MMX was > > replaced by "half-width SSE2" there > > > > You can add "not caring about first-gen sse2 CPUs" to the list as
its more like 3 or 4 generations than 1 according to the instruction tables from Agner Fog core 2 (Merom) seems the first that has partial full width support shift/pack/unpack/shuffle still are faster as MMX PM, P4, P4E all seem half speed at SSE* than MMX > well, if you want. Those are way old as well. > There is going to be a performance loss either way, except that emms > slows it down everywhere, while using sse2 is likely to be fine on minor detail being that there is a factor of around ten thousand in the speed loss between the 2 cases you compare (0.001% vs maybe 50%) Droping MMX will cause pre SSE2 CPUs to be alot slower, maybe half speed overall or less, they loose all SIMD optimizations. On older SSE2 cpus its still going to be a hefty hit too. adding emms at a video frame or slice level which is what the patches posted do pretty much has no real effect but dont belive me look at the timings worst case i see in agners tables are 18 clock cycles that at 60fps and 1slice and a slow 100mhz cpu is 0.001% even if there are 100 times more emms (due to slice level EMMS) it still at the edge of being hard to meassure. Doing EMMS per function call is of course not prcatical. theres an additional penalty for the first float instruction after emms on some cpus, 58 clock cycles (on P4) but thats still just 0.003% in the example above. anyway, i wantd to stay out of this and ill do that, just wanted to comment on the technical details [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Those who are too smart to engage in politics are punished by being governed by those who are dumber. -- Plato
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel