I think the posting guide may not be clear enough and have suggested that it be clarified. Hopefully this better communicates what is required and why in a shorter amount of space:
https://stat.ethz.ch/pipermail/r-devel/2008-June/049891.html On Fri, Jun 6, 2008 at 1:25 PM, Daniel Folkinshteyn <[EMAIL PROTECTED]> wrote: > i thought since the function code (which i provided in full) was pretty > short, it would be reasonably easy to just read the code and see what it's > doing. > > but ok, so... i am attaching a zip file, with a small sample of the data set > (tab delimited), and the function code, in a zip file (posting guidelines > claim that "some archive formats" are allowed, i assume zip is one of > them... > > would appreciate your comments! :) > > on 06/06/2008 12:05 PM Gabor Grothendieck said the following: >> >> Its summarized in the last line to r-help. Note reproducible and >> minimal. >> >> On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <[EMAIL PROTECTED]> >> wrote: >>> >>> i did! what did i miss? >>> >>> on 06/06/2008 11:45 AM Gabor Grothendieck said the following: >>>> >>>> Try reading the posting guide before posting. >>>> >>>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Anybody have any thoughts on this? Please? :) >>>>> >>>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: >>>>>> >>>>>> Hi everyone! >>>>>> >>>>>> I have a question about data processing efficiency. >>>>>> >>>>>> My data are as follows: I have a data set on quarterly institutional >>>>>> ownership of equities; some of them have had recent IPOs, some have >>>>>> not >>>>>> (I >>>>>> have a binary flag set). The total dataset size is 700k+ rows. >>>>>> >>>>>> My goal is this: For every quarter since issue for each IPO, I need to >>>>>> find a "matched" firm in the same industry, and close in market cap. >>>>>> So, >>>>>> e.g., for firm X, which had an IPO, i need to find a matched >>>>>> non-issuing >>>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing >>>>>> firm in >>>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are >>>>>> about >>>>>> 8300 >>>>>> of these). >>>>>> >>>>>> Thus it seems to me that I need to be doing a lot of data selection >>>>>> and >>>>>> subsetting, and looping (yikes!), but the result appears to be highly >>>>>> inefficient and takes ages (well, many hours). What I am doing, in >>>>>> pseudocode, is this: >>>>>> >>>>>> 1. for each quarter of data, getting out all the IPOs and all the >>>>>> eligible >>>>>> non-issuing firms. >>>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same >>>>>> industry, sort them by size, and finally grab a matching firm closest >>>>>> in >>>>>> size (the exact procedure is to grab the closest bigger firm if one >>>>>> exists, >>>>>> and just the biggest available if all are smaller) >>>>>> 3. assign the matched firm-observation the same "quarters since issue" >>>>>> as >>>>>> the IPO being matched >>>>>> 4. rbind them all into the "matching" dataset. >>>>>> >>>>>> The function I currently have is pasted below, for your reference. Is >>>>>> there any way to make it produce the same result but much faster? >>>>>> Specifically, I am guessing eliminating some loops would be very good, >>>>>> but I >>>>>> don't see how, since I need to do some fancy footwork for each IPO in >>>>>> each >>>>>> quarter to find the matching firm. I'll be doing a few things similar >>>>>> to >>>>>> this, so it's somewhat important to up the efficiency of this. Maybe >>>>>> some of >>>>>> you R-fu masters can clue me in? :) >>>>>> >>>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :) >>>>>> >>>>>> ========== my function below =========== >>>>>> >>>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, >>>>>> quarters_since_issue=40) { >>>>>> >>>>>> result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is >>>>>> cheaper, so typecast the result to matrix >>>>>> >>>>>> colnames = names(tfdata) >>>>>> >>>>>> quarterends = sort(unique(tfdata$DATE)) >>>>>> >>>>>> for (aquarter in quarterends) { >>>>>> tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] >>>>>> >>>>>> tfdata_quarter_fitting_nonissuers = tfdata_quarter[ >>>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & >>>>>> (tfdata_quarter$IPO.Flag == 0), ] >>>>>> tfdata_quarter_ipoissuers = tfdata_quarter[ >>>>>> tfdata_quarter$IPO.Flag >>>>>> == 1, ] >>>>>> >>>>>> for (i in 1:nrow(tfdata_quarter_ipoissuers)) { >>>>>> arow = tfdata_quarter_ipoissuers[i,] >>>>>> industrypeers = tfdata_quarter_fitting_nonissuers[ >>>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] >>>>>> industrypeers = industrypeers[ >>>>>> order(industrypeers$Market.Cap.13f), ] >>>>>> if ( nrow(industrypeers) > 0 ) { >>>>>> if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= >>>>>> arow$Market.Cap.13f, ]) > 0 ) { >>>>>> bestpeer = industrypeers[industrypeers$Market.Cap.13f >>>>>>> >>>>>>> = arow$Market.Cap.13f, ][1,] >>>>>> >>>>>> } >>>>>> else { >>>>>> bestpeer = industrypeers[nrow(industrypeers),] >>>>>> } >>>>>> bestpeer$Quarters.Since.IPO.Issue = >>>>>> arow$Quarters.Since.IPO.Issue >>>>>> >>>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == >>>>>> bestpeer$PERMNO] = 1 >>>>>> result = rbind(result, as.matrix(bestpeer)) >>>>>> } >>>>>> } >>>>>> #result = rbind(result, tfdata_quarter) >>>>>> print (aquarter) >>>>>> } >>>>>> >>>>>> result = as.data.frame(result) >>>>>> names(result) = colnames >>>>>> return(result) >>>>>> >>>>>> } >>>>>> >>>>>> ========= end of my function ============= >>>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.