On Tue, Nov 24, 2015 at 8:32 PM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > On Tue, Nov 24, 2015 at 8:19 PM, Michael Niedermayer <michae...@gmx.at> wrote: >> On Mon, Nov 23, 2015 at 08:00:12PM -0500, Ganesh Ajjanagadde wrote: >>> This is a trivial rewrite of the loops that results in better cache >>> efficiency. Essentially, iterating backwards over an array is a bad >>> idea, since it leads to many cache misses: forward iteration fetches a >>> new cache line that gets subsequently used, while backwards iteration >>> fetches the >>> cache line thinking that the indexing would move forwards (the common >>> behavior), >>> resulting in many cache misses. >>> >>> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within >>> solve_lls, with START/STOP_TIMER. File is >>> tests/data/fate/flac-16-lpc-cholesky.err. >>> Hardware: x86-64, Haswell, GNU/Linux. >>> >>> new: >>> 12598 decicycles in solve_lls, 2096871 runs, 281 skips >>> 12393 decicycles in solve_lls, 4193889 runs, 415 skips >>> 12250 decicycles in solve_lls, 8387253 runs, 1355 skips >>> 12585 decicycles in solve_lls,16775089 runs, 2127 skips >>> 12785 decicycles in solve_lls,33550272 runs, 4160 skips >>> 12483 decicycles in solve_lls,67101734 runs, 7130 skips >>> 12610 decicycles in solve_lls,134202614 runs, 15114 skips >>> >>> old: >>> 18101 decicycles in solve_lls, 2096659 runs, 493 skips >>> 17850 decicycles in solve_lls, 4193609 runs, 695 skips >>> 17757 decicycles in solve_lls, 8387458 runs, 1150 skips >>> 17746 decicycles in solve_lls,16775207 runs, 2009 skips >>> 17906 decicycles in solve_lls,33550820 runs, 3612 skips >>> 17931 decicycles in solve_lls,67102891 runs, 5973 skips >>> 17907 decicycles in solve_lls,134206693 runs, 11035 skips >>> >>> ------------------------------------------------------------------------------- >>> Barring asm for this function, there are two (more interesting) potential >>> optimizations for the Cholesky decomposition here: >>> 1. Notice that update_lls is doing a rank one update of the matrix via the >>> var >>> vector. Rank one update of the Cholesky decomposition can be done much more >>> efficiently than a full blown decomposition, see e.g Wikipedia. This will >>> almost >>> certainly require some API thought. >>> >>> 2. LDL' form of the Cholesky factorization offers 2 main advantages: >>> a) Avoiding sqrt and its associated cost, trading it off with extra >>> multiplications. >>> This may or may not be worth the computational cost, though it seems like in >>> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in >>> not too many extra multiplies. >>> b) It provides benefits especially in the poorly conditioned >>> case since one does not have to worry about nan's and associated tainting >>> due >>> to the sqrt. >>> This may or may not require API thought. >>> >>> I personally view 1 as being more unambiguously worthy of exploration at >>> this stage. >>> >>> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> >>> --- >>> libavutil/lls.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> this significantly changes the output for >> libavutil/lls-test >> >> is that intended ? > > Wait, I thought there was only one line changed in a diff between the > old and new output, and it was trivial involving a nan. Maybe I posted > the wrong patch, in which case sorry.
I should have been suspicious of the huge perf change, there was an extra stray line from the old patch I was working with. Reposted as v2, updated perf numbers; it is far more modest as expected. Explanation also improved. > >> >> [...] >> -- >> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB >> >> I have often repented speaking, but never of holding my tongue. >> -- Xenocrates >> >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel