Maybe you should provide a minimal, working code with data, so that we all can give it a try. In the mean time: take a look at the Rprof function to see where your code can be improved.
Good luck Bart Daniel Folkinshteyn-2 wrote: > > Hi everyone! > > I have a question about data processing efficiency. > > My data are as follows: I have a data set on quarterly institutional > ownership of equities; some of them have had recent IPOs, some have not > (I have a binary flag set). The total dataset size is 700k+ rows. > > My goal is this: For every quarter since issue for each IPO, I need to > find a "matched" firm in the same industry, and close in market cap. So, > e.g., for firm X, which had an IPO, i need to find a matched non-issuing > firm in quarter 1 since IPO, then a (possibly different) non-issuing > firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there > are about 8300 of these). > > Thus it seems to me that I need to be doing a lot of data selection and > subsetting, and looping (yikes!), but the result appears to be highly > inefficient and takes ages (well, many hours). What I am doing, in > pseudocode, is this: > > 1. for each quarter of data, getting out all the IPOs and all the > eligible non-issuing firms. > 2. for each IPO in a quarter, grab all the non-issuers in the same > industry, sort them by size, and finally grab a matching firm closest in > size (the exact procedure is to grab the closest bigger firm if one > exists, and just the biggest available if all are smaller) > 3. assign the matched firm-observation the same "quarters since issue" > as the IPO being matched > 4. rbind them all into the "matching" dataset. > > The function I currently have is pasted below, for your reference. Is > there any way to make it produce the same result but much faster? > Specifically, I am guessing eliminating some loops would be very good, > but I don't see how, since I need to do some fancy footwork for each IPO > in each quarter to find the matching firm. I'll be doing a few things > similar to this, so it's somewhat important to up the efficiency of > this. Maybe some of you R-fu masters can clue me in? :) > > I would appreciate any help, tips, tricks, tweaks, you name it! :) > > ========== my function below =========== > > fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, > quarters_since_issue=40) { > > result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is > cheaper, so typecast the result to matrix > > colnames = names(tfdata) > > quarterends = sort(unique(tfdata$DATE)) > > for (aquarter in quarterends) { > tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] > > tfdata_quarter_fitting_nonissuers = tfdata_quarter[ > (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & > (tfdata_quarter$IPO.Flag == 0), ] > tfdata_quarter_ipoissuers = tfdata_quarter[ > tfdata_quarter$IPO.Flag == 1, ] > > for (i in 1:nrow(tfdata_quarter_ipoissuers)) { > arow = tfdata_quarter_ipoissuers[i,] > industrypeers = tfdata_quarter_fitting_nonissuers[ > tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] > industrypeers = industrypeers[ > order(industrypeers$Market.Cap.13f), ] > if ( nrow(industrypeers) > 0 ) { > if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= > arow$Market.Cap.13f, ]) > 0 ) { > bestpeer = > industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,] > } > else { > bestpeer = industrypeers[nrow(industrypeers),] > } > bestpeer$Quarters.Since.IPO.Issue = > arow$Quarters.Since.IPO.Issue > > #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == > bestpeer$PERMNO] = 1 > result = rbind(result, as.matrix(bestpeer)) > } > } > #result = rbind(result, tfdata_quarter) > print (aquarter) > } > > result = as.data.frame(result) > names(result) = colnames > return(result) > > } > > ========= end of my function ============= > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/Improving-data-processing-efficiency-tp17676300p17678034.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.