Le dimanche 23 août 2015 à 09:22 -0700, Andrew a écrit :
> I tried wrapping your code in a function and compiling it by running
> it once as a warmup. That does help, I went from 12s to 7s after
> warmup. But you're still really far from C++. I don't really know
> anything about DataFrame, but I suspect your problem is with the
> generating the subsets(the @where line). Your loop allocates 1.6 GB
> of memory so I think it's copying part of your data structure with
> every iteration.
>
> I'm not really sure how you'd fix this, given that I don't know
> anything about DataFrame, but I know in 0.4 and 0.5 they're moving
> towards using array views instead of array copies. In that case,
> taking a subset of an array creates a view(negligible memory) onto
> the original array, rather than copying all the data. If that's the
> problem, I suspect this will get faster in future versions.
Michael: What's the equivalent of DataFrame in C++? Or did you only use
low-level functions? I so, you should be able to get similar
performance in Julia. But we'd need to see the code to find out.
Regards
> On Saturday, August 22, 2015 at 7:01:28 PM UTC-4, Michael Wang wrote:
> > I am new to Julia. I heard that Julia has the performance with Cpp
> > even though it is a high level language. I tested an example on my
> > machine, however, the result was that Julia was in the same
> > ballpark with R not with Cpp. Here is my codes.
> > R:
> > ptm <- proc.time()
> >
> > DPY <- 252 ## days per year
> > NWINDOW <- 126 ## can be smaller or larger than 252
> >
> > ds <- read.csv("xri.csv") ## a sample data set
> >
> > ## PS: this is much faster than assigning to a data frame in a loop
> > b.ols <- sd.ols <- rep(NA, nrow(ds))
> >
> > for (i in 1:nrow(ds)) {
> > thisday <- ds$day[i]
> > if (thisday %% DPY != 0) next ## calculate only on year end
> > if (thisday < DPY) next ## start only without NA
> > thisfm <- ds$fm[i]
> > datasubset <- subset( ds, (ds$fm==thisfm) & (ds$day>=(thisday
> > -NWINDOW)) & (ds$day<=(thisday-1)) )
> > olsreg <- lm(xr ~ xm, data = datasubset)
> > b.ols[i] <- coef(olsreg)[2]
> > sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
> > cat(i, " ") ## ping me to see we are not dead for large data
> > sets
> > }
> >
> > ds$b.ols <- b.ols
> > ds$sd.ols <- sd.ols
> >
> > cat("\nOLS Beta Regressions are Done\n")
> >
> > ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x,
> > na.rm=T))
> > ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x,
> > na.rm=T))
> >
> > cat("Cross-Sectional OLS Statistics are Done\n")
> >
> > ds <- within(ds, {
> > w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
> > b.vck <- round(w.ols*b.ols + (1
> > -w.ols)*xsect.mean,4)
> > b.ols <- round(b.ols,4)
> > })
> >
> > cat("OLS and VCK are Done. Now Writing Output.\n")
> >
> > proc.time() - ptm
> >
> >
> > The running time is around 30 seconds for R.
> >
> > Julia:
> > using DataFrames
> > using DataFramesMeta
> > using GLM
> >
> > tic()
> > DPY = 252 ## days per year
> > NWINDOW = 126 ## can be smaller or larger than 252
> >
> > ds = readtable("xri.csv") ## a sample data set
> >
> > # create two empty arrays to store b_ols and sd_ols value
> > b_ols = DataArray(Float64, size(ds)[1])
> > sd_ols = DataArray(Float64, size(ds)[1])
> >
> > for i = 1:size(ds)[1]
> > thisDay = ds[i, :day] ## Julia DataFrame way of accessing data,
> > in R: ds$day[i]
> > if mod(thisDay, DPY) != 0
> > continue
> > end
> > if thisDay < DPY
> > continue
> > end
> > thisFm = ds[i, :fm]
> > dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay -
> > NWINDOW)) & (:day .<= (thisDay - 1)))
> > olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from
> > package GLM
> > b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
> > sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients'
> > standard error
> > print(i, " ")
> > end
> >
> > ds[:b_ols] = b_ols
> > ds[:sd_ols] = sd_ols
> >
> > print("\nOLS Beta Regressions are Done\n")
> >
> > ds = join(ds, by(ds, :day) do ds
> > DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd =
> > std(dropna(ds[:b_ols])))
> > end, on = [:day], kind = :inner)
> > ds = sort!(ds)
> >
> > print("Cross-Sectional OLS Statistics are Done\n")
> >
> > ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
> > ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .*
> > :xsect_mean, 4))
> > ds[:b_ols] = @with(ds, round(:b_ols, 4))
> >
> > print("OLS and VCK are Done. Now Writing Output.\n")
> >
> > toc()
> >
> > The running time is around 15 seconds for Julia.
> >
> > I tried C++, too. Having the same output with R and Julia, C++ only
> > used 0.23 seconds. Can someone tell me why this is happening?
> >
> >
> >
> >
> >
> >