Hi all,
I'm noticing a strange performance issue with expressions such as this one:
n = 100000
a = zeros(Float32, n)
b = rand(Float32, n)
c = rand(Float32, n)
function test(a, b, c)
@simd for i in 1:length(a)
@inbounds a[i] += b[i] * c[i] * (c[i] < b[i]) * (c[i] > b[i]) *
(c[i] <= b[i]) * (c[i] >= b[i])
end
end
The problem depends on the number of statements in the expression and
whether the comparisons are explicitely cast to Float32.
In Julia 0.4-rc4, I get the following:
@inbounds a[i] += b[i] * c[i] * (c[i] < b[i]) * (c[i] > b[i]) *
(c[i] <= b[i]) * (c[i] >= b[i])
> test(a, b, c)
> @time test(a, b, c)
0.000143 seconds (4 allocations: 160 bytes)
@inbounds a[i] += b[i] * (c[i] < b[i]) * (c[i] < b[i]) * (c[i] < b[i])
> test(a, b, c)
> @time test(a, b, c)
0.000004 seconds (4 allocations: 160 bytes)
Four or more, loop is NOT vectorised: @inbounds a[i] += b[i] * (c[i] <
b[i]) * (c[i] < b[i]) * (c[i] < b[i]) * (c[i] < b[i])
> test(a, b, c)
> @time test(a, b, c)
0.000021 seconds (204 allocations: 3.281 KB)
Explicit casts, loop is vectorised again: @inbounds a[i] += b[i] *
Float32(c[i] < b[i]) * Float32(c[i] < b[i]) * Float32(c[i] < b[i]) *
Float32(c[i] < b[i])
> test(a, b, c)
> @time test(a, b, c)
0.000003 seconds (4 allocations: 160 bytes)
Julia Version 0.5.0-dev+769
Commit d9f7c21* (2015-10-14 12:03 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3