Not including the readtable in the original code was a goof by me and would 
shave off ~0.4 seconds on the baseline.

In the faster code I posted ~60-70% are spent in the GLM fitting so there 
is not that much performance to gain in tweaking the data extraction code 
The indexing into the array is 5% tops.

It is very possible that the macro itself is not slow but the way it is 
used. As I said, I dont know much about DataFramesMeta.

On Friday, August 28, 2015 at 2:36:22 PM UTC+2, David Gold wrote:
>
> I'm traveling and cannot take the time to go through this code more 
> thoroughly. However, I see at least a few issues worth mentioning in 
> addition to the other comments:
>
> 1) Note that Kristoffer's function 'f' in 
> https://gist.github.com/KristofferC/207dcc6f4f22eed73562 does not include 
> the readtable operation, so that is not being included in the "fast" time. 
> Work is being done on a performant CSV reader, but I believe the present 
> implementations in both DataFrames and Base are either on a par or slower 
> than R's reader.
>
> 2) Unfortunately, naive indexing into DataFrames is not type-certain. That 
> is to say that Julia's JIT compilation mechanism will not register the type 
> information of a value returned from indexing into a DataFrame. However, it 
> will detect this information if you pass the relevant columns of said 
> DataFrame into a function and perform the relevant operations on the 
> columns within the function. Thus you may see some further perf 
> improvements if you wrap the loop in Kristoffer's "fast" 'f' in another 
> function that takes the relevant columns of the DataFrame as arguments.
>
> 3) I haven't tested this hypothesis, but I doubt that the macros per se 
> are incurring the performance penalty. In fact, the @with macro employs 
> precisely the strategy noted above and for this reason is generally faster 
> than the naive alternative. I suspect what is really contributing to the 
> performance costs in those lines is the extensive use of broadcasted 
> operators, since those allocate new arrays. If you can find a way to do 
> those operations in-place you may save some time there.A
>
> 4) Naive use of anonymous functions currently can incur a performance 
> penalty, which may be affecting the 'join' operation in your original code.
>
> On Wednesday, August 26, 2015 at 4:43:13 PM UTC-6, Michael Wang wrote:
>>
>> Thank you Kristoffer. This helps a lot.
>>
>> On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson 
>> wrote:
>>>
>>> What's good with Julia is that if something is if a library or something 
>>> is slow it is usually easy to tweak it to get your performance.
>>>
>>> For example, if we only focus on the beta regressions, here is your 
>>> current code (a little modified to remove the benchmarking from reading the 
>>> csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2
>>>
>>> On my computer it takes this amount of time:
>>>
>>> elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc 
>>> time)
>>>
>>> However, we are not using all the information we have. For example, from 
>>> your file we can see that the entries are sorted. We can therefore quickly 
>>> find the candidate rows that have the correct "fm" and then from these 
>>> candidate rows find the ones that are within the window. Something like 
>>> this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I 
>>> haven't tested this carefully so it might be some off by one errors but it 
>>> should do the same amount of work.
>>>
>>> This one takes:
>>> elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc 
>>> time)
>>>
>>> I don't know if you can tell DataFramesMeta that columns are sorted, if 
>>> so it can probably be sped up.
>>>
>>>
>>> On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote:
>>>>
>>>> I am new to Julia. I heard that Julia has the performance with Cpp even 
>>>> though it is a high level language. I tested an example on my machine, 
>>>> however, the result was that Julia was in the same ballpark with R not 
>>>> with 
>>>> Cpp. Here is my codes.
>>>> R:
>>>> ptm <- proc.time()
>>>>
>>>> DPY <- 252  ## days per year
>>>> NWINDOW <- 126  ## can be smaller or larger than 252
>>>>
>>>> ds <- read.csv("xri.csv")  ## a sample data set
>>>>
>>>> ## PS: this is much faster than assigning to a data frame in a loop
>>>> b.ols <- sd.ols <- rep(NA, nrow(ds))
>>>>
>>>> for (i in 1:nrow(ds)) {
>>>>     thisday <- ds$day[i]
>>>>     if (thisday %% DPY != 0) next  ## calculate only on year end
>>>>     if (thisday < DPY) next  ## start only without NA
>>>>     thisfm <- ds$fm[i]
>>>>     datasubset <- subset( ds, (ds$fm==thisfm) & 
>>>> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) )
>>>>     olsreg <- lm(xr ~ xm, data = datasubset)
>>>>     b.ols[i] <- coef(olsreg)[2]
>>>>     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
>>>>     cat(i, " ")  ## ping me to see we are not dead for large data sets
>>>> }
>>>>
>>>> ds$b.ols <- b.ols
>>>> ds$sd.ols <- sd.ols
>>>>
>>>> cat("\nOLS Beta Regressions are Done\n")
>>>>
>>>> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T))
>>>> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T))
>>>>
>>>> cat("Cross-Sectional OLS Statistics are Done\n")
>>>>
>>>> ds <- within(ds, {
>>>>                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
>>>>                  b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4)
>>>>                  b.ols <- round(b.ols,4)
>>>>              })
>>>>
>>>> cat("OLS and VCK are Done.  Now Writing Output.\n")
>>>>
>>>> proc.time() - ptm
>>>>
>>>>
>>>> The running time is around 30 seconds for R.
>>>>
>>>> Julia:
>>>> using DataFrames
>>>> using DataFramesMeta
>>>> using GLM
>>>>
>>>> tic()
>>>> DPY = 252  ## days per year
>>>> NWINDOW = 126  ## can be smaller or larger than 252
>>>>
>>>> ds = readtable("xri.csv")  ## a sample data set
>>>>
>>>> # create two empty arrays to store b_ols and sd_ols value
>>>> b_ols = DataArray(Float64, size(ds)[1])
>>>> sd_ols = DataArray(Float64, size(ds)[1])
>>>>
>>>> for i = 1:size(ds)[1]
>>>> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: 
>>>> ds$day[i]
>>>> if mod(thisDay, DPY) != 0
>>>> continue
>>>> end
>>>> if thisDay < DPY
>>>> continue
>>>> end
>>>> thisFm = ds[i, :fm]
>>>> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - 
>>>> NWINDOW)) & (:day .<= (thisDay - 1)))
>>>> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM
>>>> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
>>>> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard 
>>>> error
>>>> print(i, " ")
>>>> end
>>>>
>>>> ds[:b_ols] = b_ols
>>>> ds[:sd_ols] = sd_ols
>>>>
>>>> print("\nOLS Beta Regressions are Done\n")
>>>>
>>>> ds = join(ds, by(ds, :day) do ds
>>>>     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = 
>>>> std(dropna(ds[:b_ols])))
>>>> end, on = [:day], kind = :inner)
>>>> ds = sort!(ds)
>>>>
>>>> print("Cross-Sectional OLS Statistics are Done\n")
>>>>
>>>> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
>>>> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* 
>>>> :xsect_mean, 4))
>>>> ds[:b_ols] = @with(ds, round(:b_ols, 4))
>>>>
>>>> print("OLS and VCK are Done.  Now Writing Output.\n")
>>>>
>>>> toc()
>>>>
>>>> The running time is around 15 seconds for Julia.
>>>>
>>>> I tried C++, too. Having the same output with R and Julia, C++ only 
>>>> used 0.23 seconds. Can someone tell me why this is happening?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to