Le dimanche 23 août 2015 à 09:22 -0700, Andrew a écrit :
> I tried wrapping your code in a function and compiling it by running 
> it once as a warmup. That does help, I went from 12s to 7s after 
> warmup. But you're still really far from C++. I don't really know 
> anything about DataFrame, but I suspect your problem is with the 
> generating the subsets(the @where line). Your loop allocates 1.6 GB 
> of memory so I think it's copying part of your data structure with 
> every iteration. 
> 
> I'm not really sure how you'd fix this, given that I don't know 
> anything about DataFrame, but I know in 0.4 and 0.5 they're moving 
> towards using array views instead of array copies. In that case, 
> taking a subset of an array creates a view(negligible memory) onto 
> the original array, rather than copying all the data. If that's the 
> problem, I suspect this will get faster in future versions.
Michael: What's the equivalent of DataFrame in C++? Or did you only use
low-level functions? I so, you should be able to get similar
performance in Julia. But we'd need to see the code to find out.


Regards

> On Saturday, August 22, 2015 at 7:01:28 PM UTC-4, Michael Wang wrote:
> > I am new to Julia. I heard that Julia has the performance with Cpp 
> > even though it is a high level language. I tested an example on my 
> > machine, however, the result was that Julia was in the same 
> > ballpark with R not with Cpp. Here is my codes.
> > R:
> > ptm <- proc.time()
> > 
> > DPY <- 252  ## days per year
> > NWINDOW <- 126  ## can be smaller or larger than 252
> > 
> > ds <- read.csv("xri.csv")  ## a sample data set
> > 
> > ## PS: this is much faster than assigning to a data frame in a loop
> > b.ols <- sd.ols <- rep(NA, nrow(ds))
> > 
> > for (i in 1:nrow(ds)) {
> >     thisday <- ds$day[i]
> >     if (thisday %% DPY != 0) next  ## calculate only on year end
> >     if (thisday < DPY) next  ## start only without NA
> >     thisfm <- ds$fm[i]
> >     datasubset <- subset( ds, (ds$fm==thisfm) & (ds$day>=(thisday
> > -NWINDOW)) & (ds$day<=(thisday-1)) )
> >     olsreg <- lm(xr ~ xm, data = datasubset)
> >     b.ols[i] <- coef(olsreg)[2]
> >     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
> >     cat(i, " ")  ## ping me to see we are not dead for large data 
> > sets
> > }
> > 
> > ds$b.ols <- b.ols
> > ds$sd.ols <- sd.ols
> > 
> > cat("\nOLS Beta Regressions are Done\n")
> > 
> > ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, 
> > na.rm=T))
> > ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, 
> > na.rm=T))
> > 
> > cat("Cross-Sectional OLS Statistics are Done\n")
> > 
> > ds <- within(ds, {
> >                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
> >                  b.vck <- round(w.ols*b.ols + (1
> > -w.ols)*xsect.mean,4)
> >                  b.ols <- round(b.ols,4)
> >              })
> > 
> > cat("OLS and VCK are Done.  Now Writing Output.\n")
> > 
> > proc.time() - ptm
> > 
> > 
> > The running time is around 30 seconds for R.
> > 
> > Julia:
> > using DataFrames
> > using DataFramesMeta
> > using GLM
> > 
> > tic()
> > DPY = 252  ## days per year
> > NWINDOW = 126  ## can be smaller or larger than 252
> > 
> > ds = readtable("xri.csv")  ## a sample data set
> > 
> > # create two empty arrays to store b_ols and sd_ols value
> > b_ols = DataArray(Float64, size(ds)[1])
> > sd_ols = DataArray(Float64, size(ds)[1])
> > 
> > for i = 1:size(ds)[1]
> >     thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, 
> > in R: ds$day[i]
> >     if mod(thisDay, DPY) != 0
> >             continue
> >     end
> >     if thisDay < DPY
> >             continue
> >     end
> >     thisFm = ds[i, :fm]
> >     dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - 
> > NWINDOW)) & (:day .<= (thisDay - 1)))
> >     olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from 
> > package GLM
> >     b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
> >     sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' 
> > standard error
> >     print(i, " ")
> > end
> > 
> > ds[:b_ols] = b_ols
> > ds[:sd_ols] = sd_ols
> > 
> > print("\nOLS Beta Regressions are Done\n")
> > 
> > ds = join(ds, by(ds, :day) do ds
> >     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = 
> > std(dropna(ds[:b_ols])))
> > end, on = [:day], kind = :inner)
> > ds = sort!(ds)
> > 
> > print("Cross-Sectional OLS Statistics are Done\n")
> > 
> > ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
> > ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* 
> > :xsect_mean, 4))
> > ds[:b_ols] = @with(ds, round(:b_ols, 4))
> > 
> > print("OLS and VCK are Done.  Now Writing Output.\n")
> > 
> > toc()
> > 
> > The running time is around 15 seconds for Julia.
> > 
> > I tried C++, too. Having the same output with R and Julia, C++ only 
> > used 0.23 seconds. Can someone tell me why this is happening?
> > 
> > 
> > 
> > 
> > 
> > 

Reply via email to