Not including the readtable in the original code was a goof by me and would shave off ~0.4 seconds on the baseline.
In the faster code I posted ~60-70% are spent in the GLM fitting so there is not that much performance to gain in tweaking the data extraction code The indexing into the array is 5% tops. It is very possible that the macro itself is not slow but the way it is used. As I said, I dont know much about DataFramesMeta. On Friday, August 28, 2015 at 2:36:22 PM UTC+2, David Gold wrote: > > I'm traveling and cannot take the time to go through this code more > thoroughly. However, I see at least a few issues worth mentioning in > addition to the other comments: > > 1) Note that Kristoffer's function 'f' in > https://gist.github.com/KristofferC/207dcc6f4f22eed73562 does not include > the readtable operation, so that is not being included in the "fast" time. > Work is being done on a performant CSV reader, but I believe the present > implementations in both DataFrames and Base are either on a par or slower > than R's reader. > > 2) Unfortunately, naive indexing into DataFrames is not type-certain. That > is to say that Julia's JIT compilation mechanism will not register the type > information of a value returned from indexing into a DataFrame. However, it > will detect this information if you pass the relevant columns of said > DataFrame into a function and perform the relevant operations on the > columns within the function. Thus you may see some further perf > improvements if you wrap the loop in Kristoffer's "fast" 'f' in another > function that takes the relevant columns of the DataFrame as arguments. > > 3) I haven't tested this hypothesis, but I doubt that the macros per se > are incurring the performance penalty. In fact, the @with macro employs > precisely the strategy noted above and for this reason is generally faster > than the naive alternative. I suspect what is really contributing to the > performance costs in those lines is the extensive use of broadcasted > operators, since those allocate new arrays. If you can find a way to do > those operations in-place you may save some time there.A > > 4) Naive use of anonymous functions currently can incur a performance > penalty, which may be affecting the 'join' operation in your original code. > > On Wednesday, August 26, 2015 at 4:43:13 PM UTC-6, Michael Wang wrote: >> >> Thank you Kristoffer. This helps a lot. >> >> On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson >> wrote: >>> >>> What's good with Julia is that if something is if a library or something >>> is slow it is usually easy to tweak it to get your performance. >>> >>> For example, if we only focus on the beta regressions, here is your >>> current code (a little modified to remove the benchmarking from reading the >>> csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2 >>> >>> On my computer it takes this amount of time: >>> >>> elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc >>> time) >>> >>> However, we are not using all the information we have. For example, from >>> your file we can see that the entries are sorted. We can therefore quickly >>> find the candidate rows that have the correct "fm" and then from these >>> candidate rows find the ones that are within the window. Something like >>> this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I >>> haven't tested this carefully so it might be some off by one errors but it >>> should do the same amount of work. >>> >>> This one takes: >>> elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc >>> time) >>> >>> I don't know if you can tell DataFramesMeta that columns are sorted, if >>> so it can probably be sped up. >>> >>> >>> On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote: >>>> >>>> I am new to Julia. I heard that Julia has the performance with Cpp even >>>> though it is a high level language. I tested an example on my machine, >>>> however, the result was that Julia was in the same ballpark with R not >>>> with >>>> Cpp. Here is my codes. >>>> R: >>>> ptm <- proc.time() >>>> >>>> DPY <- 252 ## days per year >>>> NWINDOW <- 126 ## can be smaller or larger than 252 >>>> >>>> ds <- read.csv("xri.csv") ## a sample data set >>>> >>>> ## PS: this is much faster than assigning to a data frame in a loop >>>> b.ols <- sd.ols <- rep(NA, nrow(ds)) >>>> >>>> for (i in 1:nrow(ds)) { >>>> thisday <- ds$day[i] >>>> if (thisday %% DPY != 0) next ## calculate only on year end >>>> if (thisday < DPY) next ## start only without NA >>>> thisfm <- ds$fm[i] >>>> datasubset <- subset( ds, (ds$fm==thisfm) & >>>> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) ) >>>> olsreg <- lm(xr ~ xm, data = datasubset) >>>> b.ols[i] <- coef(olsreg)[2] >>>> sd.ols[i] <- sqrt(vcov(olsreg)[2, 2]) >>>> cat(i, " ") ## ping me to see we are not dead for large data sets >>>> } >>>> >>>> ds$b.ols <- b.ols >>>> ds$sd.ols <- sd.ols >>>> >>>> cat("\nOLS Beta Regressions are Done\n") >>>> >>>> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T)) >>>> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T)) >>>> >>>> cat("Cross-Sectional OLS Statistics are Done\n") >>>> >>>> ds <- within(ds, { >>>> w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2) >>>> b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4) >>>> b.ols <- round(b.ols,4) >>>> }) >>>> >>>> cat("OLS and VCK are Done. Now Writing Output.\n") >>>> >>>> proc.time() - ptm >>>> >>>> >>>> The running time is around 30 seconds for R. >>>> >>>> Julia: >>>> using DataFrames >>>> using DataFramesMeta >>>> using GLM >>>> >>>> tic() >>>> DPY = 252 ## days per year >>>> NWINDOW = 126 ## can be smaller or larger than 252 >>>> >>>> ds = readtable("xri.csv") ## a sample data set >>>> >>>> # create two empty arrays to store b_ols and sd_ols value >>>> b_ols = DataArray(Float64, size(ds)[1]) >>>> sd_ols = DataArray(Float64, size(ds)[1]) >>>> >>>> for i = 1:size(ds)[1] >>>> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: >>>> ds$day[i] >>>> if mod(thisDay, DPY) != 0 >>>> continue >>>> end >>>> if thisDay < DPY >>>> continue >>>> end >>>> thisFm = ds[i, :fm] >>>> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - >>>> NWINDOW)) & (:day .<= (thisDay - 1))) >>>> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM >>>> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients >>>> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard >>>> error >>>> print(i, " ") >>>> end >>>> >>>> ds[:b_ols] = b_ols >>>> ds[:sd_ols] = sd_ols >>>> >>>> print("\nOLS Beta Regressions are Done\n") >>>> >>>> ds = join(ds, by(ds, :day) do ds >>>> DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = >>>> std(dropna(ds[:b_ols]))) >>>> end, on = [:day], kind = :inner) >>>> ds = sort!(ds) >>>> >>>> print("Cross-Sectional OLS Statistics are Done\n") >>>> >>>> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2)) >>>> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* >>>> :xsect_mean, 4)) >>>> ds[:b_ols] = @with(ds, round(:b_ols, 4)) >>>> >>>> print("OLS and VCK are Done. Now Writing Output.\n") >>>> >>>> toc() >>>> >>>> The running time is around 15 seconds for Julia. >>>> >>>> I tried C++, too. Having the same output with R and Julia, C++ only >>>> used 0.23 seconds. Can someone tell me why this is happening? >>>> >>>> >>>> >>>> >>>> >>>>
