On Sun, May 17, 2015 at 12:52 PM, Milan Bouchet-Valat <[email protected]> wrote:
> Le dimanche 17 mai 2015 à 09:25 -0700, Mohammed El-Beltagy a écrit :
>
> You are quite right about the type assertions and that @inbounds would
> certainly speed things up.
> However, I am concerned here with how memory was being allocated. I wish
> that somebody who is familiar with DataArray would explain this behavior.
>
> That's a known design issue with DataArrays, and the reason why John Myles
> White has started working on Nullable and NullableArrays to replace them. As

Didn't know about this part of the story.

P.S. your example leads me to hit
https://github.com/JuliaLang/julia/issues/11313 . Thank you for
exposing it....

> Yichao noted, []/getindex is type-unstable for DataArrays as it can return
> NA, and this kills performance in Julia.
>
> To improve performance, you can access the internals of the DataArray, doing
> something like:
>
> function countGT(x::DataArray{Float64,1})
>     count=0.0
>     for i=1:length(x)
>         if !isna(x, i)
>             count+= (x.data[i]>5.0)? 1.0 : 0.0
>         end
>     end
>     count
>
> end
>
> Always write isna(x, i) instead of isna(x[i]), since the latter suffers from
> type instability.
>
> Regards
>
>
>
> On Sunday, May 17, 2015 at 6:12:11 PM UTC+2, Yichao Yu wrote:
>
> On Sun, May 17, 2015 at 11:28 AM, Mohammed El-Beltagy
> <[email protected]> wrote:
>> Today while trying optimize a piece code I came across a rather curious
>> behavior of when allocation memory when accessing a DataArray.
>>
>> x=rand(1:10,1000000);
>> function countGT(x::Array{Int,1})
>
> Since the algorithm is the same for both types, I think you don't need
> the type assert here. Julia will automatically specialize on the type
> you pass in.
>
>>     count=0
>>     for i=1:length(x)
>>       count+= (x[i]>5)? 1: 0
>
> add `@inbounds` here will improve the performance for `Array`. Not
> sure if it can help with `DataArray` yet though.
>
>>     end
>>     count
>> end
>>
>> Here is what you get after running @time (compilation excluded)
>>
>> @time countGT(x);
>> elapsed time: 0.00847156 seconds (96 bytes allocated)
>>
>> That is not too bad. @time at least allocated 80 bytes and the extra 16
>> bytes is for creating the variable "count", so far so good.
>> Now lets see if we do the same a floating point array.
>> x=rand(1000000);
>> function countGT(x::Array{Float64,1})
>>     count=0.0
>>     for i=1:length(x)
>>       count+= (x[i]>5.0)? 1.0: 0.0
>>     end
>>     count
>> end
>>
>> countGT(x)
>> @time countGT(x)
>>
>> You get
>> elapsed time: 0.00177126 seconds (96 bytes allocated)
>> Which still pretty good. Now, the problem start to show up when I have a
>> DataArray
>> x=@data rand(1000000);
>> function countGT(x::DataArray{Float64,1})
>>     count=0.0
>>     for i=1:length(x)
>>       count+= (x[i]>5.0)? 1.0: 0.0
>>     end
>>     count
>> end
>
> `getindex` of DataArray appears to be not type stable. It returns
> either `NAType` or the data type. I think this is probably the reason
> for the allocation.
>
>>
>> countGT(x)
>> @time countGT(x)
>>
>> You we get
>> elapsed time: 0.23610454 seconds (16000096 bytes allocated)
>>
>> The bytes allocated seems to scale with the size of the DataArray. So it
>> seems that mere act of accessing an element in a DataArray allocates
>> memory.
>>
>> I am wondering there could be a better way.
>>
>>
>
> I'm not familiar with DataArrays and it's API but I would guess it can
> use Nullable or sth similar.
>
>>
>>
>>
>>
>>
>
>

Reply via email to