To clarify, there were actually two issues: one thing that may not be clear is that > elapsed time: 0.23610454 seconds (16000096 bytes allocated)
tells you how many bytes were allocated, but it omits mentioning that most/all of those were (or will be) freed. In other words, this was _not_ symptomatic of a leak---it was just a message whose meaning could have been clearer. That behavior just changed in https://github.com/JuliaLang/julia/pull/11186. Hopefully it will be clearer going forward. The issue that Yichao mentioned was more subtle and a much bigger problem, but I don't think this is what you were noticing, Mohammed. That more serious issue seems to be fixed by https://github.com/JuliaLang/julia/pull/11314 --Tim On Sunday, May 17, 2015 05:53:14 PM Yichao Yu wrote: > On Sun, May 17, 2015 at 5:05 PM, Mohammed El-Beltagy > > <[email protected]> wrote: > > Many thanks Milan and Yichao, this was very informative. I am also > > delighted that I helped in a very small way expose what appears to be a > > problem with memory leakage. > > It was actually much worse than a memory leakage. It was actually > freeing memory that is in use. (AFAICT, given how a GC works, it > usually won't leak anything when it fires, but it can free something > by mistake if the code that uses it is badly written.) > See explaination in the comment of this issue[1] for why GC roots (and > friends) are important. > > [1] https://github.com/JuliaLang/julia/pull/11190#issuecomment-100066267 > > > I love this community! > > > > On Sunday, May 17, 2015 at 7:51:59 PM UTC+2, Yichao Yu wrote: > >> On Sun, May 17, 2015 at 12:52 PM, Milan Bouchet-Valat <[email protected]> > >> > >> wrote: > >> > Le dimanche 17 mai 2015 à 09:25 -0700, Mohammed El-Beltagy a écrit : > >> > > >> > You are quite right about the type assertions and that @inbounds would > >> > certainly speed things up. > >> > However, I am concerned here with how memory was being allocated. I > >> > wish > >> > that somebody who is familiar with DataArray would explain this > >> > behavior. > >> > > >> > That's a known design issue with DataArrays, and the reason why John > >> > Myles > >> > White has started working on Nullable and NullableArrays to replace > >> > them. As > >> > >> Didn't know about this part of the story. > >> > >> P.S. your example leads me to hit > >> https://github.com/JuliaLang/julia/issues/11313 . Thank you for > >> exposing it.... > >> > >> > Yichao noted, []/getindex is type-unstable for DataArrays as it can > >> > return > >> > NA, and this kills performance in Julia. > >> > > >> > To improve performance, you can access the internals of the DataArray, > >> > doing > >> > something like: > >> > > >> > function countGT(x::DataArray{Float64,1}) > >> > > >> > count=0.0 > >> > for i=1:length(x) > >> > > >> > if !isna(x, i) > >> > > >> > count+= (x.data[i]>5.0)? 1.0 : 0.0 > >> > > >> > end > >> > > >> > end > >> > count > >> > > >> > end > >> > > >> > Always write isna(x, i) instead of isna(x[i]), since the latter suffers > >> > from > >> > type instability. > >> > > >> > Regards > >> > > >> > > >> > > >> > On Sunday, May 17, 2015 at 6:12:11 PM UTC+2, Yichao Yu wrote: > >> > > >> > On Sun, May 17, 2015 at 11:28 AM, Mohammed El-Beltagy > >> > > >> > <[email protected]> wrote: > >> >> Today while trying optimize a piece code I came across a rather > >> >> curious > >> >> behavior of when allocation memory when accessing a DataArray. > >> >> > >> >> x=rand(1:10,1000000); > >> >> function countGT(x::Array{Int,1}) > >> > > >> > Since the algorithm is the same for both types, I think you don't need > >> > the type assert here. Julia will automatically specialize on the type > >> > you pass in. > >> > > >> >> count=0 > >> >> for i=1:length(x) > >> >> > >> >> count+= (x[i]>5)? 1: 0 > >> > > >> > add `@inbounds` here will improve the performance for `Array`. Not > >> > sure if it can help with `DataArray` yet though. > >> > > >> >> end > >> >> count > >> >> > >> >> end > >> >> > >> >> Here is what you get after running @time (compilation excluded) > >> >> > >> >> @time countGT(x); > >> >> elapsed time: 0.00847156 seconds (96 bytes allocated) > >> >> > >> >> That is not too bad. @time at least allocated 80 bytes and the extra > >> >> 16 > >> >> bytes is for creating the variable "count", so far so good. > >> >> Now lets see if we do the same a floating point array. > >> >> x=rand(1000000); > >> >> function countGT(x::Array{Float64,1}) > >> >> > >> >> count=0.0 > >> >> for i=1:length(x) > >> >> > >> >> count+= (x[i]>5.0)? 1.0: 0.0 > >> >> > >> >> end > >> >> count > >> >> > >> >> end > >> >> > >> >> countGT(x) > >> >> @time countGT(x) > >> >> > >> >> You get > >> >> elapsed time: 0.00177126 seconds (96 bytes allocated) > >> >> Which still pretty good. Now, the problem start to show up when I have > >> >> a > >> >> DataArray > >> >> x=@data rand(1000000); > >> >> function countGT(x::DataArray{Float64,1}) > >> >> > >> >> count=0.0 > >> >> for i=1:length(x) > >> >> > >> >> count+= (x[i]>5.0)? 1.0: 0.0 > >> >> > >> >> end > >> >> count > >> >> > >> >> end > >> > > >> > `getindex` of DataArray appears to be not type stable. It returns > >> > either `NAType` or the data type. I think this is probably the reason > >> > for the allocation. > >> > > >> >> countGT(x) > >> >> @time countGT(x) > >> >> > >> >> You we get > >> >> elapsed time: 0.23610454 seconds (16000096 bytes allocated) > >> >> > >> >> The bytes allocated seems to scale with the size of the DataArray. So > >> >> it > >> >> seems that mere act of accessing an element in a DataArray allocates > >> >> memory. > >> >> > >> >> I am wondering there could be a better way. > >> > > >> > I'm not familiar with DataArrays and it's API but I would guess it can > >> > use Nullable or sth similar.
