On Thu, Sep 8, 2016 at 7:04 PM, DNF <[email protected]> wrote:

> But if branch prediction doesn't factor in, what is the explanation of
> this:
>

I didn't say it has no effect. It's irrelevant here because,

1. Having a branch turns off SIMD, which is the main factor
2. It's changing the problem. Different input will for sure have different
performance characteristics.


>
> *julia> *a *=* *rand*(5000);
>
> *julia> *b *=* *rand*(5000);
>
> *julia> *c *=* *rand*(5000) *+* 0.5;
> *julia> *d *=* *rand*(5000) *+* 1;
>
> *julia> **@time* *essai*(200,a,b);
>
>  14.607105 seconds (5 allocations: 1.922 KB)
>
>
> *julia> **@time* *essai*(200,a,c);
>
>   8.357925 seconds (5 allocations: 1.922 KB)
>
> *julia> **@time* *essai*(200,a,d);
>
>   3.159876 seconds (5 allocations: 1.922 KB)
>
>
> On Friday, September 9, 2016 at 12:53:46 AM UTC+2, Yichao Yu wrote:
>>
>> Shape is irrelevant since it doesn't affect the order in the loop at all.
>>
>> Branch prediction is not the issue here.
>>
>> The issue is optimizing memory access and simd.
>>
>> It is illegal to optimize the original code into `a[k] += ss1 > ss2`. It
>> is legal to optimize the `if ss1 > ss2 ak += 1 end` version to `ak += ss1 >
>> ss2` and this is the optimization LLVM should do but doesn't in this case.
>>
>> Also, the thing to look for to check if there's vectorization in llvm ir
>> is the vector type in the loop body like
>>
>> ```
>>   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
>>   %offset.idx = or i64 %index, 1
>>   %20 = add i64 %offset.idx, -1
>>   %21 = getelementptr i64, i64* %19, i64 %20
>>   %22 = bitcast i64* %21 to <4 x i64>*
>>   store <4 x i64> zeroinitializer, <4 x i64>* %22, align 8
>>   %23 = getelementptr i64, i64* %21, i64 4
>>   %24 = bitcast i64* %23 to <4 x i64>*
>>   store <4 x i64> zeroinitializer, <4 x i64>* %24, align 8
>>   %25 = getelementptr i64, i64* %21, i64 8
>>   %26 = bitcast i64* %25 to <4 x i64>*
>>   store <4 x i64> zeroinitializer, <4 x i64>* %26, align 8
>>   %27 = getelementptr i64, i64* %21, i64 12
>>   %28 = bitcast i64* %27 to <4 x i64>*
>>   store <4 x i64> zeroinitializer, <4 x i64>* %28, align 8
>>   %index.next = add i64 %index, 16
>>   %29 = icmp eq i64 %index.next, %n.vec
>> ```
>>
>> having a BB named `vector.body` doesn't mean the loop is vectorized.
>>
>>
>>
>> On Thu, Sep 8, 2016 at 6:40 PM, 'Greg Plowman' via julia-users <
>> [email protected]> wrote:
>>
>>> The difference is probably simd.
>>>
>>> the branch will code will not use simd.
>>>
>>> Either of these should eliminate branch and allow simd.
>>> ak += ss1>ss2
>>> ak += ifelse(ss1>ss2, 1, 0)
>>>
>>> Check with @code_llvm, look for section vector.body
>>>
>>>
>>>  at 5:45:30 AM UTC+10, Dupont wrote:
>>>
>>>> What is strange to me is that this is much slower
>>>>
>>>>
>>>> function essai(n, s1, s2)
>>>>     a = Vector{Int64}(n)
>>>>
>>>>     @inbounds for k = 1:n
>>>>         ak = 0
>>>>         for ss1 in s1, ss2 in s2
>>>>             if ss1 > ss2
>>>>             ak += 1
>>>>             end
>>>>         end
>>>>         a[k] = ak
>>>>     end
>>>> end
>>>>
>>>>
>>>>
>>

Reply via email to