Le lundi 24 août 2015 à 07:40 -0700, Daniel Carrera a écrit :
> Ugh... I tried to make this faster, but I can't.
> 
> It looks like one of the culprits is the "@where" line. On my
> computer your program takes 13s, of which 5s are spent on the
> "@where" line. As a test, I re-wrote the program in plain Julia up to
> the "@where" line (which I replaced by native Julia), and the time
> dropped from 5s to 0.018s. So we have evidence that, in principle,
> Julia should be able to perform reasonably. But unfortunately the GLM
> module only seems to accept DataFrames (it's hard to tell, the
> documentation is very poor).
> 
> All in all, it seems to me that the features provided by DataFrames
> come at a significant speed penalty compared to a simple Julia
> implementation.
> 
> Caveat: I do not use DataFrames.
The current issues regarding performance with DataFrames are well
-known. Basically, a design that allows specializing on the types of
the columns you work on is needed. For more details, see in particular
these two issues:
https://github.com/JuliaStats/DataFrames.jl/issues/744
https://github.com/JuliaStats/DataFrames.jl/issues/523


But you should be able to get a good performance by writing a function
that takes the columns you need as arrays, instead of taking the whole
data frame, so that the function is specialized on the types. You can
also always work with columns as separate arrays, and create the
DataFrame just before using GLM. Only the convenience functions are not
as fast as they could be.


Regards

> Cheers,
> Daniel.
> 
> 
> On Sunday, 23 August 2015 01:01:30 UTC+2, Michael Wang wrote:
> > I am new to Julia. I heard that even Julia is a high level
> > language, but it has the speed of C or C++. I have tested an
> > example on my machine. Using the same input and having the same
> > output, R uses around 30 seconds, Julia uses around 15 seconds,
> > while C++ only uses 0.23 seconds. Why this is happening? I have
> > attached my codes and sample dataset.
> > 
> > R codes:
> > 
> > DPY <- 252  ## days per year
> > NWINDOW <- 126  ## can be smaller or larger than 252
> > 
> > ds <- read.csv("xri.csv")  ## a sample data set
> > 
> > b.ols <- sd.ols <- rep(NA, nrow(ds))
> > 
> > for (i in 1:nrow(ds)) {
> >     thisday <- ds$day[i]
> >     if (thisday %% DPY != 0) next  ## calculate only on year end
> >     if (thisday < DPY) next  ## start only without NA
> >     thisfm <- ds$fm[i]
> >     datasubset <- subset( ds, (ds$fm==thisfm) & (ds$day>=(thisday
> > -NWINDOW)) & (ds$day<=(thisday-1)) )
> >          olsreg <- lm(xr ~ xm, data = datasubset)
> >     b.ols[i] <- coef(olsreg)[2]
> >     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
> >     cat(i, " ")  ## ping me to see we are not dead for large data
> > sets
> > }
> > 
> > ds$b.ols <- b.ols
> > ds$sd.ols <- sd.ols
> > 
> > cat("\nOLS Beta Regressions are Done\n")
> > 
> > ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x,
> > na.rm=T))
> > ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x,
> > na.rm=T))
> > 
> > cat("Cross-Sectional OLS Statistics are Done\n")
> > 
> > ds <- within(ds, {
> >                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
> >                  b.vck <- round(w.ols*b.ols + (1
> > -w.ols)*xsect.mean,4)
> >                  b.ols <- round(b.ols,4)
> >              })
> > 
> > cat("OLS and VCK are Done.  Now Writing Output.\n")
> > 
> > 
> > 
> > 
> > Julia codes:
> > # load in the required package
> > using DataFrames
> > using DataFramesMeta
> > using GLM
> > 
> > tic()
> > DPY = 252  ## days per year
> > NWINDOW = 126  ## can be smaller or larger than 252
> > 
> > ds = readtable("xri.csv")  ## a sample data set
> > 
> > # create two empty arrays to store b_ols and sd_ols value
> > b_ols = DataArray(Float64, size(ds)[1])
> > sd_ols = DataArray(Float64, size(ds)[1])
> > 
> > for i = 1:size(ds)[1]
> >     thisDay = ds[i, :day] ## Julia DataFrame way of accessing data,
> > in R: ds$day[i]
> >     if mod(thisDay, DPY) != 0
> >             continue
> >     end
> >     if thisDay < DPY
> >             continue
> >     end
> >     thisFm = ds[i, :fm]
> >     dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay -
> > NWINDOW)) & (:day .<= (thisDay - 1)))
> >     ## DataFramesMeta useage. fast subseting a dataframe. the dot
> > operator is the same as Matlab representing
> >     ## element-wise operation
> >     olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from
> > package GLM
> >     b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
> >     sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients'
> > standard error
> >     print(i, " ")
> > end
> > 
> > # adding new columns to the ds dataframe
> > ds[:b_ols] = b_ols
> > ds[:sd_ols] = sd_ols
> > 
> > print("\nOLS Beta Regressions are Done\n")
> > 
> > ds = join(ds, by(ds, :day) do ds
> >     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd =
> > std(dropna(ds[:b_ols])))
> > end, on = [:day], kind = :inner)
> > ds = sort!(ds)
> > 
> > print("Cross-Sectional OLS Statistics are Done\n")
> > 
> > # adding new columns and editing columns using DataFrameMeta 
> > ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
> > ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .*
> > :xsect_mean, 4))
> > ds[:b_ols] = @with(ds, round(:b_ols, 4))
> > 
> > print("OLS and VCK are Done.  Now Writing Output.\n")
> > 
> > toc()
> > 
> > 
> > 
> > 

Reply via email to