On Sun, May 17, 2015 at 12:52 PM, Milan Bouchet-Valat <[email protected]> wrote: > Le dimanche 17 mai 2015 à 09:25 -0700, Mohammed El-Beltagy a écrit : > > You are quite right about the type assertions and that @inbounds would > certainly speed things up. > However, I am concerned here with how memory was being allocated. I wish > that somebody who is familiar with DataArray would explain this behavior. > > That's a known design issue with DataArrays, and the reason why John Myles > White has started working on Nullable and NullableArrays to replace them. As
Didn't know about this part of the story. P.S. your example leads me to hit https://github.com/JuliaLang/julia/issues/11313 . Thank you for exposing it.... > Yichao noted, []/getindex is type-unstable for DataArrays as it can return > NA, and this kills performance in Julia. > > To improve performance, you can access the internals of the DataArray, doing > something like: > > function countGT(x::DataArray{Float64,1}) > count=0.0 > for i=1:length(x) > if !isna(x, i) > count+= (x.data[i]>5.0)? 1.0 : 0.0 > end > end > count > > end > > Always write isna(x, i) instead of isna(x[i]), since the latter suffers from > type instability. > > Regards > > > > On Sunday, May 17, 2015 at 6:12:11 PM UTC+2, Yichao Yu wrote: > > On Sun, May 17, 2015 at 11:28 AM, Mohammed El-Beltagy > <[email protected]> wrote: >> Today while trying optimize a piece code I came across a rather curious >> behavior of when allocation memory when accessing a DataArray. >> >> x=rand(1:10,1000000); >> function countGT(x::Array{Int,1}) > > Since the algorithm is the same for both types, I think you don't need > the type assert here. Julia will automatically specialize on the type > you pass in. > >> count=0 >> for i=1:length(x) >> count+= (x[i]>5)? 1: 0 > > add `@inbounds` here will improve the performance for `Array`. Not > sure if it can help with `DataArray` yet though. > >> end >> count >> end >> >> Here is what you get after running @time (compilation excluded) >> >> @time countGT(x); >> elapsed time: 0.00847156 seconds (96 bytes allocated) >> >> That is not too bad. @time at least allocated 80 bytes and the extra 16 >> bytes is for creating the variable "count", so far so good. >> Now lets see if we do the same a floating point array. >> x=rand(1000000); >> function countGT(x::Array{Float64,1}) >> count=0.0 >> for i=1:length(x) >> count+= (x[i]>5.0)? 1.0: 0.0 >> end >> count >> end >> >> countGT(x) >> @time countGT(x) >> >> You get >> elapsed time: 0.00177126 seconds (96 bytes allocated) >> Which still pretty good. Now, the problem start to show up when I have a >> DataArray >> x=@data rand(1000000); >> function countGT(x::DataArray{Float64,1}) >> count=0.0 >> for i=1:length(x) >> count+= (x[i]>5.0)? 1.0: 0.0 >> end >> count >> end > > `getindex` of DataArray appears to be not type stable. It returns > either `NAType` or the data type. I think this is probably the reason > for the allocation. > >> >> countGT(x) >> @time countGT(x) >> >> You we get >> elapsed time: 0.23610454 seconds (16000096 bytes allocated) >> >> The bytes allocated seems to scale with the size of the DataArray. So it >> seems that mere act of accessing an element in a DataArray allocates >> memory. >> >> I am wondering there could be a better way. >> >> > > I'm not familiar with DataArrays and it's API but I would guess it can > use Nullable or sth similar. > >> >> >> >> >> > >
