On Wed, Oct 14, 2015 at 10:57 AM, Damien <[email protected]> wrote: > Hi all, > > I'm noticing a strange performance issue with expressions such as this one: > > n = 100000 > a = zeros(Float32, n) > b = rand(Float32, n) > c = rand(Float32, n) > > function test(a, b, c) > @simd for i in 1:length(a) > @inbounds a[i] += b[i] * c[i] * (c[i] < b[i]) * (c[i] > b[i]) * > (c[i] <= b[i]) * (c[i] >= b[i]) > end > end > > > The problem depends on the number of statements in the expression and > whether the comparisons are explicitely cast to Float32. > > In Julia 0.4-rc4, I get the following: > @inbounds a[i] += b[i] * c[i] * (c[i] < b[i]) * (c[i] > b[i]) * > (c[i] <= b[i]) * (c[i] >= b[i]) > >> test(a, b, c) >> @time test(a, b, c) > > 0.000143 seconds (4 allocations: 160 bytes) > > > > > @inbounds a[i] += b[i] * (c[i] < b[i]) * (c[i] < b[i]) * (c[i] < b[i]) > >> test(a, b, c) >> @time test(a, b, c) > 0.000004 seconds (4 allocations: 160 bytes) > > > Four or more, loop is NOT vectorised: @inbounds a[i] += b[i] * (c[i] < b[i]) > * (c[i] < b[i]) * (c[i] < b[i]) * (c[i] < b[i]) > > >> test(a, b, c) >> @time test(a, b, c) > 0.000021 seconds (204 allocations: 3.281 KB) > > > Explicit casts, loop is vectorised again: @inbounds a[i] += b[i] * > Float32(c[i] < b[i]) * Float32(c[i] < b[i]) * Float32(c[i] < b[i]) * > Float32(c[i] < b[i]) > >> test(a, b, c) >> @time test(a, b, c) > > 0.000003 seconds (4 allocations: 160 bytes) > > > > Julia Version 0.5.0-dev+769 > Commit d9f7c21* (2015-10-14 12:03 UTC) > Platform Info: > System: Darwin (x86_64-apple-darwin13.4.0) > CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz > WORD_SIZE: 64 > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge) > LAPACK: libopenblas > LIBM: libopenlibm > LLVM: libLLVM-3.3 >
The inlining is a little too fragile and you should check with @code_llvm if all the functions are inlined. I've also noticed that the SHA you give doesn't seems to be a valid commit on JuliaLang/julia so I couldn't check if the inlining fix is included. > > >
