Hi Kristoffer, I tried using your fast.jl code, and found a very 
interesting result. On my machine, the first time I test the code, the time 
is

4.362875 seconds (4.67 M allocations: 229.209 MB, 1.59% gc time)

And the second time is 

0.340492 seconds (825.38 k allocations: 52.617 MB, 4.49% gc time)

It is exactly the same line. Can you tell me why this is happening?


On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson wrote:
>
> What's good with Julia is that if something is if a library or something 
> is slow it is usually easy to tweak it to get your performance.
>
> For example, if we only focus on the beta regressions, here is your 
> current code (a little modified to remove the benchmarking from reading the 
> csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2
>
> On my computer it takes this amount of time:
>
> elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc 
> time)
>
> However, we are not using all the information we have. For example, from 
> your file we can see that the entries are sorted. We can therefore quickly 
> find the candidate rows that have the correct "fm" and then from these 
> candidate rows find the ones that are within the window. Something like 
> this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I haven't 
> tested this carefully so it might be some off by one errors but it should 
> do the same amount of work.
>
> This one takes:
> elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc 
> time)
>
> I don't know if you can tell DataFramesMeta that columns are sorted, if so 
> it can probably be sped up.
>
>
> On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote:
>>
>> I am new to Julia. I heard that Julia has the performance with Cpp even 
>> though it is a high level language. I tested an example on my machine, 
>> however, the result was that Julia was in the same ballpark with R not with 
>> Cpp. Here is my codes.
>> R:
>> ptm <- proc.time()
>>
>> DPY <- 252  ## days per year
>> NWINDOW <- 126  ## can be smaller or larger than 252
>>
>> ds <- read.csv("xri.csv")  ## a sample data set
>>
>> ## PS: this is much faster than assigning to a data frame in a loop
>> b.ols <- sd.ols <- rep(NA, nrow(ds))
>>
>> for (i in 1:nrow(ds)) {
>>     thisday <- ds$day[i]
>>     if (thisday %% DPY != 0) next  ## calculate only on year end
>>     if (thisday < DPY) next  ## start only without NA
>>     thisfm <- ds$fm[i]
>>     datasubset <- subset( ds, (ds$fm==thisfm) & 
>> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) )
>>     olsreg <- lm(xr ~ xm, data = datasubset)
>>     b.ols[i] <- coef(olsreg)[2]
>>     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
>>     cat(i, " ")  ## ping me to see we are not dead for large data sets
>> }
>>
>> ds$b.ols <- b.ols
>> ds$sd.ols <- sd.ols
>>
>> cat("\nOLS Beta Regressions are Done\n")
>>
>> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T))
>> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T))
>>
>> cat("Cross-Sectional OLS Statistics are Done\n")
>>
>> ds <- within(ds, {
>>                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
>>                  b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4)
>>                  b.ols <- round(b.ols,4)
>>              })
>>
>> cat("OLS and VCK are Done.  Now Writing Output.\n")
>>
>> proc.time() - ptm
>>
>>
>> The running time is around 30 seconds for R.
>>
>> Julia:
>> using DataFrames
>> using DataFramesMeta
>> using GLM
>>
>> tic()
>> DPY = 252  ## days per year
>> NWINDOW = 126  ## can be smaller or larger than 252
>>
>> ds = readtable("xri.csv")  ## a sample data set
>>
>> # create two empty arrays to store b_ols and sd_ols value
>> b_ols = DataArray(Float64, size(ds)[1])
>> sd_ols = DataArray(Float64, size(ds)[1])
>>
>> for i = 1:size(ds)[1]
>> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: 
>> ds$day[i]
>> if mod(thisDay, DPY) != 0
>> continue
>> end
>> if thisDay < DPY
>> continue
>> end
>> thisFm = ds[i, :fm]
>> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - NWINDOW)) 
>> & (:day .<= (thisDay - 1)))
>> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM
>> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
>> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard 
>> error
>> print(i, " ")
>> end
>>
>> ds[:b_ols] = b_ols
>> ds[:sd_ols] = sd_ols
>>
>> print("\nOLS Beta Regressions are Done\n")
>>
>> ds = join(ds, by(ds, :day) do ds
>>     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = 
>> std(dropna(ds[:b_ols])))
>> end, on = [:day], kind = :inner)
>> ds = sort!(ds)
>>
>> print("Cross-Sectional OLS Statistics are Done\n")
>>
>> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
>> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* 
>> :xsect_mean, 4))
>> ds[:b_ols] = @with(ds, round(:b_ols, 4))
>>
>> print("OLS and VCK are Done.  Now Writing Output.\n")
>>
>> toc()
>>
>> The running time is around 15 seconds for Julia.
>>
>> I tried C++, too. Having the same output with R and Julia, C++ only used 
>> 0.23 seconds. Can someone tell me why this is happening?
>>
>>
>>
>>
>>
>>

Reply via email to