The code ran to completion at least 17x times faster than the equivalent python implementation. Out of curiosity, I collected the elapsed time and % spent in gc statistics from @time and plotted them from across the whole run:
This was run on a shared cluster so the spikes in time are probably due to other things going on with the node. The linear decrease in elapsed time confirms that the immutable type change fixed this issue. The trend in % time spent is linear until the end when the cost of gc was greater than the computations of the inner loop, makes sense. Not sure if anyone else cares about this plot besides me, but figured I'd toss it out there anyway. On Tuesday, April 7, 2015 at 2:46:48 PM UTC-4, Mauro wrote: > > Glad that helped! (Also not all immutable are inlined that way, only > when isbits==true) > > Maybe the gc time increased because it had to look at more and more > objects as more and more Result instances were created? > > On Tue, 2015-04-07 at 18:24, Adam Labadorf <alab...@gmail.com > <javascript:>> wrote: > > Yes, that gives a massive improvement and I see the same gc behavior you > > got. I could notice the gc increasing slowly after just watching for a > > minute or two, and that seems to have stopped now. I'll know for sure > when > > I run the whole thing and will report back. > > > > Thanks so much to all of you for your swift replies, this was my first > > julia experience and it's great to have such a supportive community. > > > > On Tuesday, April 7, 2015 at 11:40:49 AM UTC-4, Mauro wrote: > >> > >> Make Result an immutable: > >> > >> immutable Result > >> pvalue::Float64 > >> i::Int64 > >> j::Int64 > >> end > >> > >> that way the numbers are stored in the S array directly and all its > >> memory is preallocated. Otherwise S just holds pointers to the Results > >> objects. gc time is only 2% for me then (but I didn't run it for more > >> than 10min). > >> > >> However, why the gc time increases in the original code, I don't know. > >> I don't think it should. > >> > >> On Tue, 2015-04-07 at 16:26, Adam Labadorf <alab...@gmail.com > >> <javascript:>> wrote: > >> > I moved compute! out of the main function and pass the ratios as you > >> > suggested and the performance is a bit better but I still notice the > gc > >> > time increasing albeit more slowly, new code below. The csv I'm using > is > >> > available at > >> > > >> > https://www.dropbox.com/s/uqeisg1vx027gjc/all_mRNA_nonzero_norm_counts_trim.csv?dl=0 > > >> > > >> > using HypothesisTests, ArrayViews > >> > > >> > type Result > >> > pvalue::Float64 > >> > i::Int64 > >> > j::Int64 > >> > end > >> > > >> > function readtable(fn) > >> > fp = readcsv(fn) > >> > columns = fp[1,2:end] > >> > rows = fp[2:end,1] > >> > data = float(fp[2:end,2:end]) > >> > return (columns,rows,data) > >> > end > >> > @time (cols, genes, counts) = > >> > readtable("../all_mRNA_nonzero_norm_counts_trim.csv") > >> > > >> > h_cols = find([c[1] == 'H' for c in cols]) > >> > c_cols = find([c[1] == 'C' for c in cols]) > >> > > >> > # restrict to HD and control > >> > # add a pseudocount to avoid division by zero errors > >> > counts = counts[:,[h_cols;c_cols]] + 0.01 > >> > > >> > h_cols = 1:length(h_cols) > >> > c_cols = (length(h_cols)+1):size(counts,2) > >> > > >> > # arrays are stored in column order, transpose counts to make > >> > # accessing more efficient > >> > counts = transpose(counts) > >> > > >> > function > >> > > >> > compute!(S::Array{Result,1},ratios::Array{Float64,1},tot_i::Int64,i::Int64,j::Int64,h_cols::UnitRange{Int64},c_cols::UnitRange{Int64},M::Int64) > > > >> > >> > t = UnequalVarianceTTest(view(ratios,h_cols),view(ratios,c_cols)) > >> > S[tot_i] = Result(pvalue(t),i,j) > >> > end > >> > > >> > function main(counts,genes,h_cols,c_cols) > >> > > >> > N = size(genes,1) > >> > M = size(counts,1) > >> > > >> > ratios = Array(Float64,M) > >> > > >> > tot_i = 0 > >> > tot = (N^2-N)/2 > >> > > >> > S = Array(Result,round(Int,tot)) > >> > > >> > for i=1:N-1 > >> > @time for j=(i+1):N > >> > tot_i += 1 > >> > > >> > # use a sigmoid function to compress log ratios to [-10,10] > >> > b = 10 > >> > for k=1:M > >> > ratios[k] = > >> b*(2/(1+exp(-log2(counts[k,i]/counts[k,j])/(b/2)))-1) > >> > end > >> > > >> > compute!(S,ratios,tot_i,i,j,h_cols,c_cols,M) > >> > end > >> > @show (tot_i,tot,tot_i/tot) > >> > end > >> > > >> > end > >> > > >> > S = main(counts,genes,h_cols,c_cols) > >> > > >> > > >> > > >> > On Tuesday, April 7, 2015 at 8:47:06 AM UTC-4, Mauro wrote: > >> >> > >> >> > >> >> On Tue, 2015-04-07 at 06:14, Adam Labadorf <alab...@gmail.com > >> >> <javascript:>> wrote: > >> >> > Thanks for the replies. I took your suggestions (and reread the > scope > >> >> > section of the docs) and am still experiencing the gc creep. Below > is > >> >> the > >> >> > complete program, with the notable changes that I wrapped the main > >> >> > computation in a function and eliminated all references to global > >> >> variables > >> >> > inside. I'm also using the most recent nightly build of 0.4. > Overall > >> >> this > >> >> > version of the code is much faster, but there is still significant > >> >> slowdown > >> >> > as the computation progresses. Is this expected? Do you see > anything > >> I'm > >> >> > doing wrong? > >> >> > > >> >> > # julia v0.4.0-dev+4159 > >> >> > using HypothesisTests, ArrayViews > >> >> > > >> >> > type Result > >> >> > pvalue::Float64 > >> >> > i::Int64 > >> >> > j::Int64 > >> >> > end > >> >> > > >> >> > function readtable(fn) > >> >> > fp = readcsv(fn) > >> >> > columns = fp[1,2:end] > >> >> > rows = fp[:,2:end] > >> >> > data = float(fp[2:end,2:end]) > >> >> > return (columns,rows,data) > >> >> > end > >> >> > @time (cols, genes, counts) = > >> >> > readtable("../all_mRNA_nonzero_norm_counts_trim.csv") > >> >> > > >> >> > h_cols = find([c[1] == 'H' for c in cols]) > >> >> > c_cols = find([c[1] == 'C' for c in cols]) > >> >> > > >> >> > # filter out genes with zeros since it messes with the ratios > >> >> > nonzero = mapslices(x -> !any(x.==0),counts,2) > >> >> > counts = counts[find(nonzero),[h_cols;c_cols]] > >> >> > > >> >> > # slices seem to be faster > >> >> > h_cols = 1:length(h_cols) > >> >> > c_cols = (length(h_cols)+1):size(counts,2) > >> >> > > >> >> > # arrays are stored in column order, transpose counts to make > >> >> > # accessing more efficient > >> >> > counts = transpose(counts) > >> >> > > >> >> > genes = genes[find(nonzero)] > >> >> > > >> >> > function main(counts,genes,h_cols,c_cols) > >> >> > > >> >> > N = size(genes,1) > >> >> > M = size(counts,1) > >> >> > > >> >> > ratios = Array(Float64,M) > >> >> > function > >> >> > > >> >> > >> > compute!(S::Array{Result,1},counts::Array{Float64,2},tot_i::Int64,i::Int64,j::Int64,h_cols::UnitRange{Int64},c_cols::UnitRange{Int64},M::Int64) > > > >> > >> >> > >> >> > for k=1:M > >> >> > ratios[k] = counts[k,i]/counts[k,j] > >> >> > end > >> >> > t = > >> UnequalVarianceTTest(view(ratios,h_cols),view(ratios,c_cols)) > >> >> > S[tot_i] = Result(pvalue(t),i,j) > >> >> > end > >> >> > >> >> Sadly, nested function are often bad as type-inference does not work > >> >> properly, as Tim suggested. Consider this example: > >> >> > >> >> function a(n) > >> >> aa(x,y) = x*y > >> >> out = 0 > >> >> for i=1:n > >> >> out += aa(i,i) > >> >> end > >> >> out > >> >> end > >> >> > >> >> bb(x,y) = x*y > >> >> function b(n) > >> >> out = 0 > >> >> for i=1:n > >> >> out += bb(i,i) > >> >> end > >> >> out > >> >> end > >> >> n = 10^7 > >> >> @time a(n) > >> >> @time a(n) # elapsed time: 0.680312065 seconds (1220 MB allocated, > >> 2.71% > >> >> gc time in 56 pauses with 0 full sweep) > >> >> b(n) > >> >> @time b(n) # elapsed time: 3.086e-6 seconds (192 bytes allocated) > >> >> > >> >> @code_warntype a(n) # see how the return type of the function is not > >> >> inferred! > >> >> @code_warntype b(n) > >> >> > >> >> Move compute! out of main and it should be better. > >> >> > >> >> > >> >> > tot_i = 0 > >> >> > tot = (N^2-N)/2 > >> >> > > >> >> > S = Array(Result,round(Int,tot)) > >> >> > > >> >> > for i=1:N-1 > >> >> > @time for j=(i+1):N > >> >> > tot_i += 1 > >> >> > compute!(S,counts,tot_i,i,j,h_cols,c_cols,M) > >> >> > end > >> >> > end > >> >> > > >> >> > end > >> >> > > >> >> > S = main(counts,genes,h_cols,c_cols) > >> >> > > >> >> > > >> >> > And the output: > >> >> > > >> >> > elapsed time: 0.427719149 seconds (23 MB allocated, 39.90% gc time > in > >> 2 > >> >> > pauses with 0 full sweep) > >> >> > elapsed time: 0.031006382 seconds (14 MB allocated) > >> >> > elapsed time: 0.131579099 seconds (14 MB allocated, 73.64% gc time > in > >> 1 > >> >> > pauses with 1 full sweep) > >> >> > elapsed time: 0.140120717 seconds (14 MB allocated, 73.58% gc time > in > >> 1 > >> >> > pauses with 0 full sweep) > >> >> > elapsed time: 0.030248237 seconds (14 MB allocated) > >> >> > ... > >> >> > elapsed time: 0.507894781 seconds (5 MB allocated, 97.65% gc time > in > >> 1 > >> >> > pauses with 0 full sweep) > >> >> > elapsed time: 0.011821657 seconds (5 MB allocated) > >> >> > elapsed time: 0.011610651 seconds (5 MB allocated) > >> >> > elapsed time: 0.011816277 seconds (5 MB allocated) > >> >> > elapsed time: 0.50779098 seconds (5 MB allocated, 97.65% gc time > in 1 > >> >> > pauses with 0 full sweep) > >> >> > elapsed time: 0.011997168 seconds (5 MB allocated) > >> >> > elapsed time: 0.011721667 seconds (5 MB allocated) > >> >> > elapsed time: 0.011561071 seconds (5 MB allocated) > >> >> > >> >> This looks ok-ish to me. The program runs, allocates memory and > every > >> >> so often the memory is garbage collected. What is not ok that the > gc > >> >> runs after only 20MB is allocated and that it takes so long. But at > >> >> that point all the memory is used up, right? Maybe that is why it > >> takes > >> >> so long then? > >> >> > >> >> > On Saturday, April 4, 2015 at 12:38:46 PM UTC-4, Patrick O'Leary > >> wrote: > >> >> >> > >> >> >> Silly me, ignoring all the commented out lines assuming they were > >> >> >> comments...yes, this is almost certainly it. > >> >> >> > >> >> >> On Saturday, April 4, 2015 at 3:24:50 AM UTC-5, Tim Holy wrote: > >> >> >>> > >> >> >>> Devectorization should never slow anything down. If it does, > then > >> you > >> >> >>> have > >> >> >>> some other problem. Here, M is a global variable, and that's > >> probably > >> >> >>> what's > >> >> >>> killing you. Performance tip #1: > >> >> >>> http://docs.julialang.org/en/latest/manual/performance-tips/ > >> >> >>> > >> >> >>> --Tim > >> >> >>> > >> >> >>> On Friday, April 03, 2015 09:43:51 AM Adam Labadorf wrote: > >> >> >>> > Hi, > >> >> >>> > > >> >> >>> > I am struggling with an issue related to garbage collection > >> taking > >> >> up > >> >> >>> the > >> >> >>> > vast majority (>99%) of compute time on a simple nested for > loop. > >> >> Code > >> >> >>> > excerpt below: > >> >> >>> > > >> >> >>> > # julia version 0.3.7 > >> >> >>> > # counts is an MxN matrix of Float64 > >> >> >>> > # N=15000 > >> >> >>> > # M=108 > >> >> >>> > # h_cols and c_cols are indices \in {1:M} > >> >> >>> > using HypothesisTests, ArrayViews > >> >> >>> > ratios = Array(Float64,M) > >> >> >>> > function compute!(S,tot_i::Int64,i::Int64,j::Int64) > >> >> >>> > ratios = view(counts,:,i)./view(counts,:,j) > >> >> >>> > #for k=1:M > >> >> >>> > # ratios[k] = counts[k,i]/counts[k,j] > >> >> >>> > #end > >> >> >>> > #ratios = counts[:,i]./counts[:,j] > >> >> >>> > t = UnequalVarianceTTest(ratios[h_cols],ratios[c_cols]) > >> >> >>> > S[tot_i] = (pvalue(t),i,j) > >> >> >>> > end > >> >> >>> > > >> >> >>> > for i=1:N-1 > >> >> >>> > @time for j=(i+1):N > >> >> >>> > tot_i += 1 > >> >> >>> > compute!(S,tot_i,i,j) > >> >> >>> > end > >> >> >>> > end > >> >> >>> > > >> >> >>> > The loop begins fast, output from time: > >> >> >>> > > >> >> >>> > elapsed time: 1.023850054 seconds (62027220 bytes allocated) > >> >> >>> > elapsed time: 0.170916977 seconds (45785624 bytes allocated) > >> >> >>> > elapsed time: 0.171598156 seconds (45782760 bytes allocated) > >> >> >>> > elapsed time: 0.173866309 seconds (45779896 bytes allocated) > >> >> >>> > elapsed time: 0.170267172 seconds (45777032 bytes allocated) > >> >> >>> > elapsed time: 0.171754713 seconds (45774168 bytes allocated) > >> >> >>> > elapsed time: 0.170110142 seconds (45771304 bytes allocated) > >> >> >>> > elapsed time: 0.175199053 seconds (45768440 bytes allocated) > >> >> >>> > elapsed time: 0.179893161 seconds (45765576 bytes allocated) > >> >> >>> > elapsed time: 0.212172824 seconds (45762712 bytes allocated) > >> >> >>> > elapsed time: 0.252750549 seconds (45759848 bytes allocated) > >> >> >>> > elapsed time: 0.254874855 seconds (45756984 bytes allocated) > >> >> >>> > elapsed time: 0.231003319 seconds (45754120 bytes allocated) > >> >> >>> > elapsed time: 0.235060195 seconds (45751256 bytes allocated) > >> >> >>> > elapsed time: 0.235379355 seconds (45748392 bytes allocated) > >> >> >>> > elapsed time: 0.927622743 seconds (45746168 bytes allocated, > >> 77.65% > >> >> gc > >> >> >>> time) > >> >> >>> > elapsed time: 0.9132931 seconds (45742664 bytes allocated, > 78.35% > >> gc > >> >> >>> time) > >> >> >>> > > >> >> >>> > But as soon as it starts doing gc the % time spent in > increases > >> >> almost > >> >> >>> > indefinitely, output from time much later: > >> >> >>> > > >> >> >>> > elapsed time: 0.174122929 seconds (36239160 bytes allocated) > >> >> >>> > elapsed time: 18.535572658 seconds (36236168 bytes allocated, > >> 99.22% > >> >> gc > >> >> >>> > time) > >> >> >>> > elapsed time: 19.189478819 seconds (36233176 bytes allocated, > >> 99.26% > >> >> gc > >> >> >>> > time) > >> >> >>> > elapsed time: 21.812889439 seconds (36230184 bytes allocated, > >> 99.35% > >> >> gc > >> >> >>> > time) > >> >> >>> > elapsed time: 22.182467723 seconds (36227192 bytes allocated, > >> 99.30% > >> >> gc > >> >> >>> > time) > >> >> >>> > elapsed time: 0.169849999 seconds (36224200 bytes allocated) > >> >> >>> > > >> >> >>> > The inner loop, despite iterating over fewer and fewer indices > >> has > >> >> >>> > massively increased the gc, and therefore overall, execution > >> time. I > >> >> >>> have > >> >> >>> > tried many things, including creating the compute function, > >> >> >>> devectorizing > >> >> >>> > the ratios calculation (which really slowed things down), > using > >> view > >> >> >>> and > >> >> >>> > sub in various places, profiling with --trace-allocation=all > but > >> I > >> >> >>> can't > >> >> >>> > figure out why this happens or how to fix it. Doing gc for 22 > >> >> seconds > >> >> >>> > obviously kills the performance, and since there are about 22M > >> >> >>> iterations > >> >> >>> > this is prohibitive. Can anyone suggest something I can try to > >> >> improve > >> >> >>> the > >> >> >>> > performance here? > >> >> >>> > > >> >> >>> > Thanks, > >> >> >>> > Adam > >> >> >>> > >> >> >>> > >> >> > >> >> > >> > >> > >