Yichao, thank you for this meaningful answer.
I understand points 1-4 to improve my coding.
- I’ve redrawn a unique graph with your 2 benchmarking methods.
- I’ve added a broadcast! version of sqrt. And two version from the MKL/VML
library (with VML.jl)
- I’ve finally added my first custom benchmark function to the plot (in
black)
=> My custom benchmark function is clearly out of scope for small n and it
seems to come from point 5.
=> with your method you’re also measuring the inner for loops : is the cost
of this loop negligible regarding the cost of sqrt ?
=> on my machine, I get quite different results : allocation is x2.5 faster
than sqrt and there is still a huge loss of performance for n<1e2
=> using broadcast!, allocation should not be part of the measurement right
? But there’s still a gap in performance ...
So, do you think you’re explanation for point 6 is valid ?
Is it just a matter of measuring or do performance of vectorized operations
penalized (by what ?) for small n ?
Thanks,
Lionel
Note : I’m going to implement a non linear solver which deals with about
100 beam elements and each element is about 10 to 100 nodes.
I want to evaluate what could be the impact of modeling my problem with one
big DOF vector (1e3 to 1e4 nodes) versus a nested vector (a vector of 100
vectors, each of 10 to 100 elements).
<https://lh3.googleusercontent.com/-SY5BcG0XvaQ/Vjzj5-tT8cI/AAAAAAAAEgs/dl1Rr34VcbA/s1600/sqrt_yichao.png>
<https://lh3.googleusercontent.com/-HJDgvDskYWo/Vjzjs3D7cHI/AAAAAAAAEgk/Ha-eqUsA3r8/s1600/sqrt_bench.png>