On Fri, Nov 6, 2015 at 12:47 PM, Yichao Yu <[email protected]> wrote:

>
>
> On Fri, Nov 6, 2015 at 12:32 PM, Lionel du Peloux <
> [email protected]> wrote:
>
>>
>> Yichao, thank you for this meaningful answer.
>>
>> I understand points 1-4 to improve my coding.
>>
>> - I’ve redrawn a unique graph with your 2 benchmarking methods.
>> - I’ve added a broadcast! version of sqrt. And two version from the
>> MKL/VML library (with VML.jl)
>> - I’ve finally added my first custom benchmark function to the plot (in
>> black)
>>
>> => My custom benchmark function is clearly out of scope for small n and
>> it seems to come from point 5.
>> => with your method you’re also measuring the inner for loops : is the
>> cost of this loop negligible regarding the cost of sqrt ?
>>
>
> Well, I would imagine that the cost of a loop is much smaller than the
> cost of measuring the time. On the machine I did the benchmark, the
> overhead of calling an empty non-inlined function in a loop is ~ 1.17ns and
> it is certainly negligible compare to GC allocation cost (which is ~2-3ns
> per 64bit).
>
>
>> => on my machine, I get quite different results : allocation is x2.5
>> faster than sqrt and there is still a huge loss of performance for n<1e2
>>
>
> Which LLVM version are you using. IIRC we are not using the sqrt intrinsic
> on LLVM 3.3 (the default one). I'm using LLVM 3.7 and the sqrt function on
> my machine is using the `vsqrtsd` instruction rather than calling the libm
> function and this makes a big difference (It doesn't seems to be vectorized
> (SIMD) and I'm not sure why.)
>

Actually I think we are using the intrinsic on LLVM 3.3 and it is
generating a single instruction for this operation so please ignore this
point.


>
>
>> => using broadcast!, allocation should not be part of the measurement
>> right ? But there’s still a gap in performance ...
>>
>
> There's also the cost of anonymous function. I'm not sure how you write
> the broadcast! version and this could be a problem
>
>
>>
>> So, do you think you’re explanation for point 6 is valid ?
>>
>
> I think it is certainly valid on my machine. Not sure about other setups.
>
>
>> Is it just a matter of measuring or do performance of vectorized
>> operations penalized (by what ?) for small n ?
>>
>> Thanks,
>> Lionel
>>
>> Note : I’m going to implement a non linear solver which deals with about
>> 100 beam elements and each element is about 10 to 100 nodes.
>> I want to evaluate what could be the impact of modeling my problem with
>> one big DOF vector (1e3 to 1e4 nodes) versus a nested vector (a vector of
>> 100 vectors, each of 10 to 100 elements).
>>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-SY5BcG0XvaQ/Vjzj5-tT8cI/AAAAAAAAEgs/dl1Rr34VcbA/s1600/sqrt_yichao.png>
>>
>>
>> <https://lh3.googleusercontent.com/-HJDgvDskYWo/Vjzjs3D7cHI/AAAAAAAAEgk/Ha-eqUsA3r8/s1600/sqrt_bench.png>
>>
>>
>>
>>
>>
>

Reply via email to