Le dimanche 17 mai 2015 à 09:25 -0700, Mohammed El-Beltagy a écrit :
> You are quite right about the type assertions and that @inbounds would
> certainly speed things up.
> However, I am concerned here with how memory was being allocated. I
> wish that somebody who is familiar with DataArray would explain this
> behavior.
That's a known design issue with DataArrays, and the reason why John
Myles White has started working on Nullable and NullableArrays to
replace them. As Yichao noted, []/getindex is type-unstable for
DataArrays as it can return NA, and this kills performance in Julia.
To improve performance, you can access the internals of the DataArray,
doing something like:
function countGT(x::DataArray{Float64,1})
count=0.0
for i=1:length(x)
if !isna(x, i)
count+= (x.data[i]>5.0)? 1.0 : 0.0
end
end
count
end
Always write isna(x, i) instead of isna(x[i]), since the latter suffers
from type instability.
Regards
> On Sunday, May 17, 2015 at 6:12:11 PM UTC+2, Yichao Yu wrote:
>
> On Sun, May 17, 2015 at 11:28 AM, Mohammed El-Beltagy
> <[email protected]> wrote:
> > Today while trying optimize a piece code I came across a
> rather curious
> > behavior of when allocation memory when accessing a
> DataArray.
> >
> > x=rand(1:10,1000000);
> > function countGT(x::Array{Int,1})
>
> Since the algorithm is the same for both types, I think you
> don't need
> the type assert here. Julia will automatically specialize on
> the type
> you pass in.
>
> > count=0
> > for i=1:length(x)
> > count+= (x[i]>5)? 1: 0
>
> add `@inbounds` here will improve the performance for `Array`.
> Not
> sure if it can help with `DataArray` yet though.
>
> > end
> > count
> > end
> >
> > Here is what you get after running @time (compilation
> excluded)
> >
> > @time countGT(x);
> > elapsed time: 0.00847156 seconds (96 bytes allocated)
> >
> > That is not too bad. @time at least allocated 80 bytes and
> the extra 16
> > bytes is for creating the variable "count", so far so good.
> > Now lets see if we do the same a floating point array.
> > x=rand(1000000);
> > function countGT(x::Array{Float64,1})
> > count=0.0
> > for i=1:length(x)
> > count+= (x[i]>5.0)? 1.0: 0.0
> > end
> > count
> > end
> >
> > countGT(x)
> > @time countGT(x)
> >
> > You get
> > elapsed time: 0.00177126 seconds (96 bytes allocated)
> > Which still pretty good. Now, the problem start to show up
> when I have a
> > DataArray
> > x=@data rand(1000000);
> > function countGT(x::DataArray{Float64,1})
> > count=0.0
> > for i=1:length(x)
> > count+= (x[i]>5.0)? 1.0: 0.0
> > end
> > count
> > end
>
> `getindex` of DataArray appears to be not type stable. It
> returns
> either `NAType` or the data type. I think this is probably the
> reason
> for the allocation.
>
> >
> > countGT(x)
> > @time countGT(x)
> >
> > You we get
> > elapsed time: 0.23610454 seconds (16000096 bytes allocated)
> >
> > The bytes allocated seems to scale with the size of the
> DataArray. So it
> > seems that mere act of accessing an element in a DataArray
> allocates memory.
> >
> > I am wondering there could be a better way.
> >
> >
>
> I'm not familiar with DataArrays and it's API but I would
> guess it can
> use Nullable or sth similar.
>
> >
> >
> >
> >
> >