On Fri, Nov 6, 2015 at 12:47 PM, Yichao Yu <[email protected]> wrote:
> > > On Fri, Nov 6, 2015 at 12:32 PM, Lionel du Peloux < > [email protected]> wrote: > >> >> Yichao, thank you for this meaningful answer. >> >> I understand points 1-4 to improve my coding. >> >> - I’ve redrawn a unique graph with your 2 benchmarking methods. >> - I’ve added a broadcast! version of sqrt. And two version from the >> MKL/VML library (with VML.jl) >> - I’ve finally added my first custom benchmark function to the plot (in >> black) >> >> => My custom benchmark function is clearly out of scope for small n and >> it seems to come from point 5. >> => with your method you’re also measuring the inner for loops : is the >> cost of this loop negligible regarding the cost of sqrt ? >> > > Well, I would imagine that the cost of a loop is much smaller than the > cost of measuring the time. On the machine I did the benchmark, the > overhead of calling an empty non-inlined function in a loop is ~ 1.17ns and > it is certainly negligible compare to GC allocation cost (which is ~2-3ns > per 64bit). > > >> => on my machine, I get quite different results : allocation is x2.5 >> faster than sqrt and there is still a huge loss of performance for n<1e2 >> > > Which LLVM version are you using. IIRC we are not using the sqrt intrinsic > on LLVM 3.3 (the default one). I'm using LLVM 3.7 and the sqrt function on > my machine is using the `vsqrtsd` instruction rather than calling the libm > function and this makes a big difference (It doesn't seems to be vectorized > (SIMD) and I'm not sure why.) > Actually I think we are using the intrinsic on LLVM 3.3 and it is generating a single instruction for this operation so please ignore this point. > > >> => using broadcast!, allocation should not be part of the measurement >> right ? But there’s still a gap in performance ... >> > > There's also the cost of anonymous function. I'm not sure how you write > the broadcast! version and this could be a problem > > >> >> So, do you think you’re explanation for point 6 is valid ? >> > > I think it is certainly valid on my machine. Not sure about other setups. > > >> Is it just a matter of measuring or do performance of vectorized >> operations penalized (by what ?) for small n ? >> >> Thanks, >> Lionel >> >> Note : I’m going to implement a non linear solver which deals with about >> 100 beam elements and each element is about 10 to 100 nodes. >> I want to evaluate what could be the impact of modeling my problem with >> one big DOF vector (1e3 to 1e4 nodes) versus a nested vector (a vector of >> 100 vectors, each of 10 to 100 elements). >> >> >> >> >> <https://lh3.googleusercontent.com/-SY5BcG0XvaQ/Vjzj5-tT8cI/AAAAAAAAEgs/dl1Rr34VcbA/s1600/sqrt_yichao.png> >> >> >> <https://lh3.googleusercontent.com/-HJDgvDskYWo/Vjzjs3D7cHI/AAAAAAAAEgk/Ha-eqUsA3r8/s1600/sqrt_bench.png> >> >> >> >> >> >
