Yichao, thank you for this meaningful answer.

I understand points 1-4 to improve my coding.

- I’ve redrawn a unique graph with your 2 benchmarking methods.
- I’ve added a broadcast! version of sqrt. And two version from the MKL/VML 
library (with VML.jl)
- I’ve finally added my first custom benchmark function to the plot (in 
black)

=> My custom benchmark function is clearly out of scope for small n and it 
seems to come from point 5.
=> with your method you’re also measuring the inner for loops : is the cost 
of this loop negligible regarding the cost of sqrt ?
=> on my machine, I get quite different results : allocation is x2.5 faster 
than sqrt and there is still a huge loss of performance for n<1e2
=> using broadcast!, allocation should not be part of the measurement right 
? But there’s still a gap in performance ...

So, do you think you’re explanation for point 6 is valid ?
Is it just a matter of measuring or do performance of vectorized operations 
penalized (by what ?) for small n ?

Thanks,
Lionel

Note : I’m going to implement a non linear solver which deals with about 
100 beam elements and each element is about 10 to 100 nodes.
I want to evaluate what could be the impact of modeling my problem with one 
big DOF vector (1e3 to 1e4 nodes) versus a nested vector (a vector of 100 
vectors, each of 10 to 100 elements).



<https://lh3.googleusercontent.com/-SY5BcG0XvaQ/Vjzj5-tT8cI/AAAAAAAAEgs/dl1Rr34VcbA/s1600/sqrt_yichao.png>

<https://lh3.googleusercontent.com/-HJDgvDskYWo/Vjzjs3D7cHI/AAAAAAAAEgk/Ha-eqUsA3r8/s1600/sqrt_bench.png>




Reply via email to