Re: [R] Improving data processing efficiency
Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
One thing that is likely to speed the code significantly is if you create 'result' to be its final size and then subscript into it. Something like: result[i, ] - bestpeer (though I'm not sure if 'i' is the proper index). Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Daniel Folkinshteyn wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Its summarized in the last line to r-help. Note reproducible and minimal. On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
That is the last line of every message to r-help. On Fri, Jun 6, 2008 at 12:05 PM, Gabor Grothendieck [EMAIL PROTECTED] wrote: Its summarized in the last line to r-help. Note reproducible and minimal. On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
i thought since the function code (which i provided in full) was pretty short, it would be reasonably easy to just read the code and see what it's doing. but ok, so... i am attaching a zip file, with a small sample of the data set (tab delimited), and the function code, in a zip file (posting guidelines claim that some archive formats are allowed, i assume zip is one of them... would appreciate your comments! :) on 06/06/2008 12:05 PM Gabor Grothendieck said the following: Its summarized in the last line to r-help. Note reproducible and minimal. On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __
Re: [R] Improving data processing efficiency
thanks for the tip! i'll try that and see how big of a difference that makes... if i am not sure what exactly the size will be, am i better off making it larger, and then later stripping off the blank rows, or making it smaller, and appending the missing rows? on 06/06/2008 11:44 AM Patrick Burns said the following: One thing that is likely to speed the code significantly is if you create 'result' to be its final size and then subscript into it. Something like: result[i, ] - bestpeer (though I'm not sure if 'i' is the proper index). Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Daniel Folkinshteyn wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list
Re: [R] Improving data processing efficiency
I think the posting guide may not be clear enough and have suggested that it be clarified. Hopefully this better communicates what is required and why in a shorter amount of space: https://stat.ethz.ch/pipermail/r-devel/2008-June/049891.html On Fri, Jun 6, 2008 at 1:25 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i thought since the function code (which i provided in full) was pretty short, it would be reasonably easy to just read the code and see what it's doing. but ok, so... i am attaching a zip file, with a small sample of the data set (tab delimited), and the function code, in a zip file (posting guidelines claim that some archive formats are allowed, i assume zip is one of them... would appreciate your comments! :) on 06/06/2008 12:05 PM Gabor Grothendieck said the following: Its summarized in the last line to r-help. Note reproducible and minimal. On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames
Re: [R] Improving data processing efficiency
just in case, uploaded it to the server, you can get the zip file i mentioned here: http://astro.temple.edu/~dfolkins/helplistfiles.zip on 06/06/2008 01:25 PM Daniel Folkinshteyn said the following: i thought since the function code (which i provided in full) was pretty short, it would be reasonably easy to just read the code and see what it's doing. but ok, so... i am attaching a zip file, with a small sample of the data set (tab delimited), and the function code, in a zip file (posting guidelines claim that some archive formats are allowed, i assume zip is one of them... would appreciate your comments! :) on 06/06/2008 12:05 PM Gabor Grothendieck said the following: Its summarized in the last line to r-help. Note reproducible and minimal. On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: i did! what did i miss? on 06/06/2008 11:45 AM Gabor Grothendieck said the following: Try reading the posting guide before posting. On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list
Re: [R] Improving data processing efficiency
Ok, sorry about the zip, then. :) Thanks for taking the trouble to clue me in as to the best posting procedure! well, here's a dput-ed version of the small data subset you can use for testing. below that, an updated version of the function, with extra explanatory comments, and producing an extra column showing exactly what is matched to what. to test, just run the function, with the dataset as sole argument. Thanks again; i'd appreciate any input on this. === begin dataset dput representation structure(list(PERMNO = c(10001L, 10001L, 10298L, 10298L, 10484L, 10484L, 10515L, 10515L, 10634L, 10634L, 11048L, 11048L, 11237L, 11294L, 11294L, 11338L, 11338L, 11404L, 11404L, 11587L, 11587L, 11591L, 11591L, 11737L, 11737L, 11791L, 11809L, 11809L, 11858L, 11858L, 11955L, 11955L, 12003L, 12003L, 12016L, 12016L, 12223L, 12223L, 12758L, 12758L, 13688L, 13688L, 16117L, 16117L, 17770L, 17770L, 21514L, 21514L, 21792L, 21792L, 21821L, 21821L, 22437L, 22437L, 22947L, 22947L, 23027L, 23027L, 23182L, 23182L, 23536L, 23536L, 23712L, 23712L, 24053L, 24053L, 24117L, 24117L, 24256L, 24256L, 24299L, 24299L, 24352L, 24352L, 24379L, 24379L, 24467L, 24467L, 24679L, 24679L, 24870L, 24870L, 25056L, 25056L, 25208L, 25208L, 25232L, 25232L, 25241L, 25590L, 25590L, 26463L, 26463L, 26470L, 26470L, 26614L, 26614L, 27385L, 27385L, 29196L, 29196L, 30411L, 30411L, 32943L, 32943L, 38893L, 38893L, 40708L, 40708L, 41005L, 41005L, 42817L, 42817L, 42833L, 42833L, 43668L, 43668L, 45947L, 45947L, 46017L, 46017L, 48274L, 48274L, 49971L, 49971L, 53786L, 53786L, 53859L, 53859L, 54199L, 54199L, 56371L, 56952L, 56952L, 57277L, 57277L, 57381L, 57381L, 58202L, 58202L, 59395L, 59395L, 59935L, 60169L, 60169L, 61188L, 61188L, 61444L, 61444L, 62690L, 62690L, 62842L, 62842L, 64290L, 64290L, 64418L, 64418L, 64450L, 64450L, 64477L, 64477L, 64557L, 64557L, 64646L, 64646L, 64902L, 64902L, 67774L, 67774L, 68910L, 68910L, 70471L, 70471L, 74406L, 74406L, 75091L, 75091L, 75304L, 75304L, 75743L, 75964L, 75964L, 76026L, 76026L, 76162L, 76170L, 76173L, 78530L, 78530L, 78682L, 78682L, 81569L, 81569L, 82502L, 82502L, 83337L, 83337L, 83919L, 83919L, 88242L, 88242L, 90852L, 90852L, 91353L, 91353L ), DATE = c(19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630), Shares.Owned = c(50100, 50100, 25, 293500, 3656629, 3827119, 4132439, 3566591, 2631193, 2500301, 775879, 816879, 38700, 1041600, 1070300, 533768, 558815, 61384492, 60466567, 194595, 196979, 359946, 314446, 106770, 107070, 20242, 1935098, 2099403, 1902125, 1766750, 41991, 41991, 34490, 36290, 589400, 596700, 1549395, 1759440, 854473, 762903, 156366785, 98780287, 2486389, 2635718, 122264, 122292, 25455916, 25458658, 71645490, 71855722, 30969596, 30409838, 2738576, 2814490, 20846605, 20930233, 1148299, 505415, 396388, 385714, 25239923, 24117950, 73465526, 73084616, 8096614, 7595742, 3937930, 3820215, 20884821, 19456342, 2127331, 2188276, 2334515, 2813347, 8267264, 8544084, 783277, 810742, 742048, 512956, 9659658, 9436873,
Re: [R] Improving data processing efficiency
That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat Daniel Folkinshteyn wrote: thanks for the tip! i'll try that and see how big of a difference that makes... if i am not sure what exactly the size will be, am i better off making it larger, and then later stripping off the blank rows, or making it smaller, and appending the missing rows? on 06/06/2008 11:44 AM Patrick Burns said the following: One thing that is likely to speed the code significantly is if you create 'result' to be its final size and then subscript into it. Something like: result[i, ] - bestpeer (though I'm not sure if 'i' is the proper index). Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Daniel Folkinshteyn wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing
Re: [R] Improving data processing efficiency
Cool, I do have an upper bound, so I'll try it and how much of a speedboost it gives me. Thanks for the suggestion! on 06/06/2008 02:03 PM Patrick Burns said the following: That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat Daniel Folkinshteyn wrote: thanks for the tip! i'll try that and see how big of a difference that makes... if i am not sure what exactly the size will be, am i better off making it larger, and then later stripping off the blank rows, or making it smaller, and appending the missing rows? on 06/06/2008 11:44 AM Patrick Burns said the following: One thing that is likely to speed the code significantly is if you create 'result' to be its final size and then subscript into it. Something like: result[i, ] - bestpeer (though I'm not sure if 'i' is the proper index). Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Daniel Folkinshteyn wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result =
Re: [R] Improving data processing efficiency
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns Sent: Friday, June 06, 2008 12:04 PM To: Daniel Folkinshteyn Cc: r-help@r-project.org Subject: Re: [R] Improving data processing efficiency That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat, I am unfamiliar with the use of the word junk as a unit of measure for data objects. I figure there are a few different possibilities: 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. 2. This was a Freudian slip based on your opinion of some datasets you have seen. 3. Somewhere between your mind and the final product jumps/chunks became junks (possibly a microsoft correction, or just typing too fast combined with number 2). 4. junks is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). Please let it be #4, I would love to be able to tell some clients that I have received a junk of data from them. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns Sent: Friday, June 06, 2008 12:04 PM To: Daniel Folkinshteyn Cc: r-help@r-project.org Subject: Re: [R] Improving data processing efficiency That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat, I am unfamiliar with the use of the word junk as a unit of measure for data objects. I figure there are a few different possibilities: 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. 2. This was a Freudian slip based on your opinion of some datasets you have seen. 3. Somewhere between your mind and the final product jumps/chunks became junks (possibly a microsoft correction, or just typing too fast combined with number 2). 4. junks is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). 5. Chinese sailing vessel. http://en.wikipedia.org/wiki/Junk_(ship) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
My guess is that number 2 is closest to the mark. Typing too fast is unfortunately not one of my habitual attributes. Gabor Grothendieck wrote: On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns Sent: Friday, June 06, 2008 12:04 PM To: Daniel Folkinshteyn Cc: r-help@r-project.org Subject: Re: [R] Improving data processing efficiency That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat, I am unfamiliar with the use of the word junk as a unit of measure for data objects. I figure there are a few different possibilities: 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. 2. This was a Freudian slip based on your opinion of some datasets you have seen. 3. Somewhere between your mind and the final product jumps/chunks became junks (possibly a microsoft correction, or just typing too fast combined with number 2). 4. junks is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). 5. Chinese sailing vessel. http://en.wikipedia.org/wiki/Junk_(ship) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
-Original Message- From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] Sent: Friday, June 06, 2008 12:33 PM To: Greg Snow Cc: Patrick Burns; Daniel Folkinshteyn; r-help@r-project.org Subject: Re: [R] Improving data processing efficiency On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns Sent: Friday, June 06, 2008 12:04 PM To: Daniel Folkinshteyn Cc: r-help@r-project.org Subject: Re: [R] Improving data processing efficiency That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat, I am unfamiliar with the use of the word junk as a unit of measure for data objects. I figure there are a few different possibilities: 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. 2. This was a Freudian slip based on your opinion of some datasets you have seen. 3. Somewhere between your mind and the final product jumps/chunks became junks (possibly a microsoft correction, or just typing too fast combined with number 2). 4. junks is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). 5. Chinese sailing vessel. http://en.wikipedia.org/wiki/Junk_(ship) Thanks for expanding my vocabulary (hmm, how am I going to use that word in context today?). So, if 5 is the case, then Pat's original statement can be reworded as: If you pick the possibly too small route, then increasing the size in largish Chinese sailing vessels is much better than adding a row boat at a time. While that is probably true, I am not sure what that would mean in terms of the original data processing question. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Hmm... ok... so i ran the code twice - once with a preallocated result, assigning rows to it, and once with a nrow=0 result, rbinding rows to it, for the first 20 quarters. There was no speedup. In fact, running with a preallocated result matrix was slower than rbinding to the matrix: for preallocated matrix: Time difference of 1.59 mins for rbinding: Time difference of 1.498628 mins (the time difference only counts from the start of the loop til the end, so the time to allocate the empty matrix was /not/ included in the time count). So, it appears that rbinding a matrix is not the bottleneck. (That it was actually faster than assigning rows could have been a random anomaly (e.g. some other process eating a bit of cpu during the run?), or not - at any rate, it doesn't make an /appreciable/ difference. Any other suggestions? :) on 06/06/2008 02:03 PM Patrick Burns said the following: That is going to be situation dependent, but if you have a reasonable upper bound, then that will be much easier and not far from optimal. If you pick the possibly too small route, then increasing the size in largish junks is much better than adding a row at a time. Pat Daniel Folkinshteyn wrote: thanks for the tip! i'll try that and see how big of a difference that makes... if i am not sure what exactly the size will be, am i better off making it larger, and then later stripping off the blank rows, or making it smaller, and appending the missing rows? on 06/06/2008 11:44 AM Patrick Burns said the following: One thing that is likely to speed the code significantly is if you create 'result' to be its final size and then subscript into it. Something like: result[i, ] - bestpeer (though I'm not sure if 'i' is the proper index). Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Daniel Folkinshteyn wrote: Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ]
Re: [R] Improving data processing efficiency
On Fri, Jun 6, 2008 at 5:10 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote: Hmm... ok... so i ran the code twice - once with a preallocated result, assigning rows to it, and once with a nrow=0 result, rbinding rows to it, for the first 20 quarters. There was no speedup. In fact, running with a preallocated result matrix was slower than rbinding to the matrix: for preallocated matrix: Time difference of 1.59 mins for rbinding: Time difference of 1.498628 mins (the time difference only counts from the start of the loop til the end, so the time to allocate the empty matrix was /not/ included in the time count). So, it appears that rbinding a matrix is not the bottleneck. (That it was actually faster than assigning rows could have been a random anomaly (e.g. some other process eating a bit of cpu during the run?), or not - at any rate, it doesn't make an /appreciable/ difference. Why not try profiling? The profr package provides an alternative display that I find more helpful than the default tools: install.packages(profr) library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
thanks for the suggestions! I'll play with this over the weekend and see what comes out. :) on 06/06/2008 06:48 PM Don MacQueen said the following: In a case like this, if you can possibly work with matrices instead of data frames, you might get significant speedup. (More accurately, I have had situations where I obtained speed up by working with matrices instead of dataframes.) Even if you have to code character columns as numeric, it can be worth it. Data frames have overhead that matrices do not. (Here's where profiling might have given a clue) Granted, there has been recent work in reducing the overhead associated with dataframes, but I think it's worth a try. Carrying along extra columns and doing row subsetting, rbinding, etc, means a lot more things happening in memory. So, for example, if all of your matching is based just on a few columns, extract those columns, convert them to a matrix, do all the matching, and then based on some sort of row index retrieve all of the associated columns. -Don At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list
Re: [R] Improving data processing efficiency
on 06/06/2008 06:55 PM hadley wickham said the following: Why not try profiling? The profr package provides an alternative display that I find more helpful than the default tools: install.packages(profr) library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. i'll give it a try over the weekend! thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
install.packages(profr) library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. Hadley so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the biggest timesuckers... i suppose i'll try using matrices and see how that stacks up (since all my cols are numeric, should be a problem-free approach). but i'm really wondering if there isn't some neat vectorized approach i could use to avoid at least one of the nested loops... __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Daniel, allow me to step off the party line here for a moment, in a problem like this it's better to code your function in C and then call it from R. You get vast amount of performance improvement instantly. (From what I see the process of recoding in C should be quite straight forward.) H. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Daniel Folkinshteyn Sent: Friday, June 06, 2008 4:35 PM To: hadley wickham Cc: r-help@r-project.org; Patrick Burns Subject: Re: [R] Improving data processing efficiency install.packages(profr) library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. Hadley so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the biggest timesuckers... i suppose i'll try using matrices and see how that stacks up (since all my cols are numeric, should be a problem-free approach). but i'm really wondering if there isn't some neat vectorized approach i could use to avoid at least one of the nested loops... __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
hadley wickham wrote: Hi, I tried this suggestion as I am curious about bottlenecks in my own R code ... Why not try profiling? The profr package provides an alternative display that I find more helpful than the default tools: install.packages(profr) install.packages(profr) Warning message: package ‘profr’ is not available any ideas? Thanks, Esmail library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. Hadley __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Esmail Bonakdarian wrote: hadley wickham wrote: Hi, I tried this suggestion as I am curious about bottlenecks in my own R code ... Why not try profiling? The profr package provides an alternative display that I find more helpful than the default tools: install.packages(profr) install.packages(profr) Warning message: package ‘profr’ is not available I selected a different mirror in place of the Iowa one and it worked. Odd, I just assumed all the same packages are available on all mirrors. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote: install.packages(profr) library(profr) p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...)) plot(p) That should at least help you see where the slow bits are. Hadley so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the biggest timesuckers... i suppose i'll try using matrices and see how that stacks up (since all my cols are numeric, should be a problem-free approach). but i'm really wondering if there isn't some neat vectorized approach i could use to avoid at least one of the nested loops... As far as a vectorized solution, I'll bet you could do ALL the lookups of non-issuers for all issuers with a single call to findInterval() (modulo some cleanup afterwards) , but the trickery needed to do that would make your code a bit opaque. And in the end I doubt it would beat mapply() (read on...) by enough to make it worthwhile. --- What you are doing is conditional on industry group and quarter. So using indus.quarter - with(tfdat, paste(as.character(DATE), as.character(HSICIG), sep=.))) and then calls like this: split( various , indus.quater[ relevant.subset ] ) you can create: a list of all issuer market caps according to quarter and group, a list of all non-issuer caps (that satisfy your 'since quarter' restriction) according to quarter and group, a list of all non issuer indexes (i.e. row numbers) that satisfy that restriction according to quarter and group Then you write a function that takes the elements of each list for a given quarter-industry group, looks up the matching non-issuers for each issuer, and returns their indexes. findInterval() will allow you to do this lookup for all issuers in one industry group in a given quarter simultaneously and greatly speed this process (but you will need to deal with the possible non-uniqueness of the non-issuer caps - perhaps by adding a tiny jitter() to the values). Then you feed the function and the lists to mapply(). The result is a list of indexes on the original data.frame. You can unsplit() this if you like, then use those indexes to build your final result data.frame. HTH, Chuck p.s. and if this all seems like too much work, you should at least avoid needlessly creating data.frames. Specifically reorder things so that industrypeers = etc is only done ONCE for each industry group by quarter combination and change stuff like nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 to any( industrypeers$Market.Cap.13f = arow$Market.Cap.13f ) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
install.packages(profr) Warning message: package 'profr' is not available I selected a different mirror in place of the Iowa one and it worked. Odd, I just assumed all the same packages are available on all mirrors. The Iowa mirror is rather out of date as the guy who was looking after it passed away. Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Improving data processing efficiency
Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Maybe you should provide a minimal, working code with data, so that we all can give it a try. In the mean time: take a look at the Rprof function to see where your code can be improved. Good luck Bart Daniel Folkinshteyn-2 wrote: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Improving-data-processing-efficiency-tp17676300p17678034.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Improving data processing efficiency
Thanks, I'll take a look at Rprof... but I think what i'm missing is facility with R idiom to get around the looping, and no amount of profiling will help me with that :) also, full working code is provided in my original post (see toward the bottom). on 06/05/2008 03:43 PM bartjoosen said the following: Maybe you should provide a minimal, working code with data, so that we all can give it a try. In the mean time: take a look at the Rprof function to see where your code can be improved. Good luck Bart Daniel Folkinshteyn-2 wrote: Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a matched firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same quarters since issue as the IPO being matched 4. rbind them all into the matching dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) == my function below === fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue quarters_since_issue) (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ]) 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } = end of my function = __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.