Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit : > I tested it, and indeed using vwsub is faster. Updated it in the reply. > > --- > > I have a question: if I tweak the load order a bit, using one less vset, it > leads to being slower (the patch I submitted is 13.2, if I make the > following change, the time would be 15.2). > But I thought it would be faster.
I would guess that v0 is needed before v8 in the internal implementation of vwsub. This kind of makes sense as the element still need to be sign-extended. Thus vwsub ends up stalling the pipeline in wait for vle8 to complete. That's just a guess though, as I don't have internal cycle timing documentation. > - vsetvli t0, a2, e8, m2, tu, ma > - vle8.v v0, (a0) > - sub a2, a2, t0 > - vsetvli zero, t0, e16, m4, tu, ma > - vle16.v v8, (a1) > - vsetvli zero, t0, e8, m2, tu, ma > - vwsub.wv v16, v8, v0 > > + vsetvli t0, a2, e16, m4, tu, ma > + vle16.v v8, (a1) > + sub a2, a2, t0 > + vsetvli zero, t0, e8, m2, tu, ma > + vle8.v v0, (a0) > + vwsub.wv v16, v8, v0 -- 雷米‧德尼-库尔蒙 http://www.remlab.net/ _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".