Hi Kristoffer, I tried using your fast.jl code, and found a very interesting result. On my machine, the first time I test the code, the time is
4.362875 seconds (4.67 M allocations: 229.209 MB, 1.59% gc time) And the second time is 0.340492 seconds (825.38 k allocations: 52.617 MB, 4.49% gc time) It is exactly the same line. Can you tell me why this is happening? On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson wrote: > > What's good with Julia is that if something is if a library or something > is slow it is usually easy to tweak it to get your performance. > > For example, if we only focus on the beta regressions, here is your > current code (a little modified to remove the benchmarking from reading the > csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2 > > On my computer it takes this amount of time: > > elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc > time) > > However, we are not using all the information we have. For example, from > your file we can see that the entries are sorted. We can therefore quickly > find the candidate rows that have the correct "fm" and then from these > candidate rows find the ones that are within the window. Something like > this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I haven't > tested this carefully so it might be some off by one errors but it should > do the same amount of work. > > This one takes: > elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc > time) > > I don't know if you can tell DataFramesMeta that columns are sorted, if so > it can probably be sped up. > > > On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote: >> >> I am new to Julia. I heard that Julia has the performance with Cpp even >> though it is a high level language. I tested an example on my machine, >> however, the result was that Julia was in the same ballpark with R not with >> Cpp. Here is my codes. >> R: >> ptm <- proc.time() >> >> DPY <- 252 ## days per year >> NWINDOW <- 126 ## can be smaller or larger than 252 >> >> ds <- read.csv("xri.csv") ## a sample data set >> >> ## PS: this is much faster than assigning to a data frame in a loop >> b.ols <- sd.ols <- rep(NA, nrow(ds)) >> >> for (i in 1:nrow(ds)) { >> thisday <- ds$day[i] >> if (thisday %% DPY != 0) next ## calculate only on year end >> if (thisday < DPY) next ## start only without NA >> thisfm <- ds$fm[i] >> datasubset <- subset( ds, (ds$fm==thisfm) & >> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) ) >> olsreg <- lm(xr ~ xm, data = datasubset) >> b.ols[i] <- coef(olsreg)[2] >> sd.ols[i] <- sqrt(vcov(olsreg)[2, 2]) >> cat(i, " ") ## ping me to see we are not dead for large data sets >> } >> >> ds$b.ols <- b.ols >> ds$sd.ols <- sd.ols >> >> cat("\nOLS Beta Regressions are Done\n") >> >> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T)) >> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T)) >> >> cat("Cross-Sectional OLS Statistics are Done\n") >> >> ds <- within(ds, { >> w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2) >> b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4) >> b.ols <- round(b.ols,4) >> }) >> >> cat("OLS and VCK are Done. Now Writing Output.\n") >> >> proc.time() - ptm >> >> >> The running time is around 30 seconds for R. >> >> Julia: >> using DataFrames >> using DataFramesMeta >> using GLM >> >> tic() >> DPY = 252 ## days per year >> NWINDOW = 126 ## can be smaller or larger than 252 >> >> ds = readtable("xri.csv") ## a sample data set >> >> # create two empty arrays to store b_ols and sd_ols value >> b_ols = DataArray(Float64, size(ds)[1]) >> sd_ols = DataArray(Float64, size(ds)[1]) >> >> for i = 1:size(ds)[1] >> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: >> ds$day[i] >> if mod(thisDay, DPY) != 0 >> continue >> end >> if thisDay < DPY >> continue >> end >> thisFm = ds[i, :fm] >> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - NWINDOW)) >> & (:day .<= (thisDay - 1))) >> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM >> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients >> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard >> error >> print(i, " ") >> end >> >> ds[:b_ols] = b_ols >> ds[:sd_ols] = sd_ols >> >> print("\nOLS Beta Regressions are Done\n") >> >> ds = join(ds, by(ds, :day) do ds >> DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = >> std(dropna(ds[:b_ols]))) >> end, on = [:day], kind = :inner) >> ds = sort!(ds) >> >> print("Cross-Sectional OLS Statistics are Done\n") >> >> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2)) >> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* >> :xsect_mean, 4)) >> ds[:b_ols] = @with(ds, round(:b_ols, 4)) >> >> print("OLS and VCK are Done. Now Writing Output.\n") >> >> toc() >> >> The running time is around 15 seconds for Julia. >> >> I tried C++, too. Having the same output with R and Julia, C++ only used >> 0.23 seconds. Can someone tell me why this is happening? >> >> >> >> >> >>
