Many thanks Milan and Yichao, this was very informative. I am also delighted that I helped in a very small way expose what appears to be a problem with memory leakage. I love this community!
On Sunday, May 17, 2015 at 7:51:59 PM UTC+2, Yichao Yu wrote: > > On Sun, May 17, 2015 at 12:52 PM, Milan Bouchet-Valat <[email protected] > <javascript:>> wrote: > > Le dimanche 17 mai 2015 à 09:25 -0700, Mohammed El-Beltagy a écrit : > > > > You are quite right about the type assertions and that @inbounds would > > certainly speed things up. > > However, I am concerned here with how memory was being allocated. I wish > > that somebody who is familiar with DataArray would explain this > behavior. > > > > That's a known design issue with DataArrays, and the reason why John > Myles > > White has started working on Nullable and NullableArrays to replace > them. As > > Didn't know about this part of the story. > > P.S. your example leads me to hit > https://github.com/JuliaLang/julia/issues/11313 . Thank you for > exposing it.... > > > Yichao noted, []/getindex is type-unstable for DataArrays as it can > return > > NA, and this kills performance in Julia. > > > > To improve performance, you can access the internals of the DataArray, > doing > > something like: > > > > function countGT(x::DataArray{Float64,1}) > > count=0.0 > > for i=1:length(x) > > if !isna(x, i) > > count+= (x.data[i]>5.0)? 1.0 : 0.0 > > end > > end > > count > > > > end > > > > Always write isna(x, i) instead of isna(x[i]), since the latter suffers > from > > type instability. > > > > Regards > > > > > > > > On Sunday, May 17, 2015 at 6:12:11 PM UTC+2, Yichao Yu wrote: > > > > On Sun, May 17, 2015 at 11:28 AM, Mohammed El-Beltagy > > <[email protected]> wrote: > >> Today while trying optimize a piece code I came across a rather curious > >> behavior of when allocation memory when accessing a DataArray. > >> > >> x=rand(1:10,1000000); > >> function countGT(x::Array{Int,1}) > > > > Since the algorithm is the same for both types, I think you don't need > > the type assert here. Julia will automatically specialize on the type > > you pass in. > > > >> count=0 > >> for i=1:length(x) > >> count+= (x[i]>5)? 1: 0 > > > > add `@inbounds` here will improve the performance for `Array`. Not > > sure if it can help with `DataArray` yet though. > > > >> end > >> count > >> end > >> > >> Here is what you get after running @time (compilation excluded) > >> > >> @time countGT(x); > >> elapsed time: 0.00847156 seconds (96 bytes allocated) > >> > >> That is not too bad. @time at least allocated 80 bytes and the extra 16 > >> bytes is for creating the variable "count", so far so good. > >> Now lets see if we do the same a floating point array. > >> x=rand(1000000); > >> function countGT(x::Array{Float64,1}) > >> count=0.0 > >> for i=1:length(x) > >> count+= (x[i]>5.0)? 1.0: 0.0 > >> end > >> count > >> end > >> > >> countGT(x) > >> @time countGT(x) > >> > >> You get > >> elapsed time: 0.00177126 seconds (96 bytes allocated) > >> Which still pretty good. Now, the problem start to show up when I have > a > >> DataArray > >> x=@data rand(1000000); > >> function countGT(x::DataArray{Float64,1}) > >> count=0.0 > >> for i=1:length(x) > >> count+= (x[i]>5.0)? 1.0: 0.0 > >> end > >> count > >> end > > > > `getindex` of DataArray appears to be not type stable. It returns > > either `NAType` or the data type. I think this is probably the reason > > for the allocation. > > > >> > >> countGT(x) > >> @time countGT(x) > >> > >> You we get > >> elapsed time: 0.23610454 seconds (16000096 bytes allocated) > >> > >> The bytes allocated seems to scale with the size of the DataArray. So > it > >> seems that mere act of accessing an element in a DataArray allocates > >> memory. > >> > >> I am wondering there could be a better way. > >> > >> > > > > I'm not familiar with DataArrays and it's API but I would guess it can > > use Nullable or sth similar. > > > >> > >> > >> > >> > >> > > > > >
