Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have not 
(I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need to 
find a matched firm in the same industry, and close in market cap. So, 
e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
firm in quarter 1 since IPO, then a (possibly different) non-issuing 
firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection and 
subsetting, and looping (yikes!), but the result appears to be highly 
inefficient and takes ages (well, many hours). What I am doing, in 
pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest in 
size (the exact procedure is to grab the closest bigger firm if one 
exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue 
as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very good, 
but I don't see how, since I need to do some fancy footwork for each IPO 
in each quarter to find the matching firm. I'll be doing a few things 
similar to this, so it's somewhat important to up the efficiency of 
this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f = 
arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Patrick Burns

One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it.  Something like:

  result[i, ] - bestpeer

(though I'm not sure if 'i' is the proper index).

Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Daniel Folkinshteyn wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have 
not (I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need 
to find a matched firm in the same industry, and close in market 
cap. So, e.g., for firm X, which had an IPO, i need to find a matched 
non-issuing firm in quarter 1 since IPO, then a (possibly different) 
non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing 
firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am 
doing, in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest 
in size (the exact procedure is to grab the closest bigger firm if 
one exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very 
good, but I don't see how, since I need to do some fancy footwork for 
each IPO in each quarter to find the matching firm. I'll be doing a 
few things similar to this, so it's somewhat important to up the 
efficiency of this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f 
= arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Gabor Grothendieck
Try reading the posting guide before posting.

On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:
 Anybody have any thoughts on this? Please? :)

 on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

 Hi everyone!

 I have a question about data processing efficiency.

 My data are as follows: I have a data set on quarterly institutional
 ownership of equities; some of them have had recent IPOs, some have not (I
 have a binary flag set). The total dataset size is 700k+ rows.

 My goal is this: For every quarter since issue for each IPO, I need to
 find a matched firm in the same industry, and close in market cap. So,
 e.g., for firm X, which had an IPO, i need to find a matched non-issuing
 firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in
 quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300
 of these).

 Thus it seems to me that I need to be doing a lot of data selection and
 subsetting, and looping (yikes!), but the result appears to be highly
 inefficient and takes ages (well, many hours). What I am doing, in
 pseudocode, is this:

 1. for each quarter of data, getting out all the IPOs and all the eligible
 non-issuing firms.
 2. for each IPO in a quarter, grab all the non-issuers in the same
 industry, sort them by size, and finally grab a matching firm closest in
 size (the exact procedure is to grab the closest bigger firm if one exists,
 and just the biggest available if all are smaller)
 3. assign the matched firm-observation the same quarters since issue as
 the IPO being matched
 4. rbind them all into the matching dataset.

 The function I currently have is pasted below, for your reference. Is
 there any way to make it produce the same result but much faster?
 Specifically, I am guessing eliminating some loops would be very good, but I
 don't see how, since I need to do some fancy footwork for each IPO in each
 quarter to find the matching firm. I'll be doing a few things similar to
 this, so it's somewhat important to up the efficiency of this. Maybe some of
 you R-fu masters can clue me in? :)

 I would appreciate any help, tips, tricks, tweaks, you name it! :)

 == my function below ===

 fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
 quarters_since_issue=40) {

result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
 cheaper, so typecast the result to matrix

colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[
 (tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag
 == 1, ]

for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[
 tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[
 order(industrypeers$Market.Cap.13f), ]
if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
 arow$Market.Cap.13f, ])  0 ) {
bestpeer = industrypeers[industrypeers$Market.Cap.13f
 = arow$Market.Cap.13f, ][1,]
}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue =
 arow$Quarters.Since.IPO.Issue

 #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
 bestpeer$PERMNO] = 1
result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

 }

 = end of my function =


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn

i did! what did i miss?

on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

Try reading the posting guide before posting.

On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional
ownership of equities; some of them have had recent IPOs, some have not (I
have a binary flag set). The total dataset size is 700k+ rows.

My goal is this: For every quarter since issue for each IPO, I need to
find a matched firm in the same industry, and close in market cap. So,
e.g., for firm X, which had an IPO, i need to find a matched non-issuing
firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in
quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300
of these).

Thus it seems to me that I need to be doing a lot of data selection and
subsetting, and looping (yikes!), but the result appears to be highly
inefficient and takes ages (well, many hours). What I am doing, in
pseudocode, is this:

1. for each quarter of data, getting out all the IPOs and all the eligible
non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same
industry, sort them by size, and finally grab a matching firm closest in
size (the exact procedure is to grab the closest bigger firm if one exists,
and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue as
the IPO being matched
4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is
there any way to make it produce the same result but much faster?
Specifically, I am guessing eliminating some loops would be very good, but I
don't see how, since I need to do some fancy footwork for each IPO in each
quarter to find the matching firm. I'll be doing a few things similar to
this, so it's somewhat important to up the efficiency of this. Maybe some of
you R-fu masters can clue me in? :)

I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
quarters_since_issue=40) {

   result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
cheaper, so typecast the result to matrix

   colnames = names(tfdata)

   quarterends = sort(unique(tfdata$DATE))

   for (aquarter in quarterends) {
   tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

   tfdata_quarter_fitting_nonissuers = tfdata_quarter[
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
(tfdata_quarter$IPO.Flag == 0), ]
   tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag
== 1, ]

   for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
   arow = tfdata_quarter_ipoissuers[i,]
   industrypeers = tfdata_quarter_fitting_nonissuers[
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
   industrypeers = industrypeers[
order(industrypeers$Market.Cap.13f), ]
   if ( nrow(industrypeers)  0 ) {
   if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
arow$Market.Cap.13f, ])  0 ) {
   bestpeer = industrypeers[industrypeers$Market.Cap.13f

= arow$Market.Cap.13f, ][1,]

   }
   else {
   bestpeer = industrypeers[nrow(industrypeers),]
   }
   bestpeer$Quarters.Since.IPO.Issue =
arow$Quarters.Since.IPO.Issue

#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
bestpeer$PERMNO] = 1
   result = rbind(result, as.matrix(bestpeer))
   }
   }
   #result = rbind(result, tfdata_quarter)
   print (aquarter)
   }

   result = as.data.frame(result)
   names(result) = colnames
   return(result)

}

= end of my function =


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Gabor Grothendieck
Its summarized in the last line to r-help.  Note reproducible and
minimal.

On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:
 i did! what did i miss?

 on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

 Try reading the posting guide before posting.

 On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED]
 wrote:

 Anybody have any thoughts on this? Please? :)

 on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

 Hi everyone!

 I have a question about data processing efficiency.

 My data are as follows: I have a data set on quarterly institutional
 ownership of equities; some of them have had recent IPOs, some have not
 (I
 have a binary flag set). The total dataset size is 700k+ rows.

 My goal is this: For every quarter since issue for each IPO, I need to
 find a matched firm in the same industry, and close in market cap. So,
 e.g., for firm X, which had an IPO, i need to find a matched non-issuing
 firm in quarter 1 since IPO, then a (possibly different) non-issuing
 firm in
 quarter 2 since IPO, etc. Repeat for each issuing firm (there are about
 8300
 of these).

 Thus it seems to me that I need to be doing a lot of data selection and
 subsetting, and looping (yikes!), but the result appears to be highly
 inefficient and takes ages (well, many hours). What I am doing, in
 pseudocode, is this:

 1. for each quarter of data, getting out all the IPOs and all the
 eligible
 non-issuing firms.
 2. for each IPO in a quarter, grab all the non-issuers in the same
 industry, sort them by size, and finally grab a matching firm closest in
 size (the exact procedure is to grab the closest bigger firm if one
 exists,
 and just the biggest available if all are smaller)
 3. assign the matched firm-observation the same quarters since issue
 as
 the IPO being matched
 4. rbind them all into the matching dataset.

 The function I currently have is pasted below, for your reference. Is
 there any way to make it produce the same result but much faster?
 Specifically, I am guessing eliminating some loops would be very good,
 but I
 don't see how, since I need to do some fancy footwork for each IPO in
 each
 quarter to find the matching firm. I'll be doing a few things similar to
 this, so it's somewhat important to up the efficiency of this. Maybe
 some of
 you R-fu masters can clue me in? :)

 I would appreciate any help, tips, tricks, tweaks, you name it! :)

 == my function below ===

 fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
 quarters_since_issue=40) {

   result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
 cheaper, so typecast the result to matrix

   colnames = names(tfdata)

   quarterends = sort(unique(tfdata$DATE))

   for (aquarter in quarterends) {
   tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

   tfdata_quarter_fitting_nonissuers = tfdata_quarter[
 (tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
   tfdata_quarter_ipoissuers = tfdata_quarter[
 tfdata_quarter$IPO.Flag
 == 1, ]

   for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
   arow = tfdata_quarter_ipoissuers[i,]
   industrypeers = tfdata_quarter_fitting_nonissuers[
 tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
   industrypeers = industrypeers[
 order(industrypeers$Market.Cap.13f), ]
   if ( nrow(industrypeers)  0 ) {
   if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
 arow$Market.Cap.13f, ])  0 ) {
   bestpeer = industrypeers[industrypeers$Market.Cap.13f

 = arow$Market.Cap.13f, ][1,]

   }
   else {
   bestpeer = industrypeers[nrow(industrypeers),]
   }
   bestpeer$Quarters.Since.IPO.Issue =
 arow$Quarters.Since.IPO.Issue

 #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
 bestpeer$PERMNO] = 1
   result = rbind(result, as.matrix(bestpeer))
   }
   }
   #result = rbind(result, tfdata_quarter)
   print (aquarter)
   }

   result = as.data.frame(result)
   names(result) = colnames
   return(result)

 }

 = end of my function =

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Gabor Grothendieck
That is the last line of every message to r-help.

On Fri, Jun 6, 2008 at 12:05 PM, Gabor Grothendieck
[EMAIL PROTECTED] wrote:
 Its summarized in the last line to r-help.  Note reproducible and
 minimal.

 On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] 
 wrote:
 i did! what did i miss?

 on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

 Try reading the posting guide before posting.

 On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED]
 wrote:

 Anybody have any thoughts on this? Please? :)

 on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

 Hi everyone!

 I have a question about data processing efficiency.

 My data are as follows: I have a data set on quarterly institutional
 ownership of equities; some of them have had recent IPOs, some have not
 (I
 have a binary flag set). The total dataset size is 700k+ rows.

 My goal is this: For every quarter since issue for each IPO, I need to
 find a matched firm in the same industry, and close in market cap. So,
 e.g., for firm X, which had an IPO, i need to find a matched non-issuing
 firm in quarter 1 since IPO, then a (possibly different) non-issuing
 firm in
 quarter 2 since IPO, etc. Repeat for each issuing firm (there are about
 8300
 of these).

 Thus it seems to me that I need to be doing a lot of data selection and
 subsetting, and looping (yikes!), but the result appears to be highly
 inefficient and takes ages (well, many hours). What I am doing, in
 pseudocode, is this:

 1. for each quarter of data, getting out all the IPOs and all the
 eligible
 non-issuing firms.
 2. for each IPO in a quarter, grab all the non-issuers in the same
 industry, sort them by size, and finally grab a matching firm closest in
 size (the exact procedure is to grab the closest bigger firm if one
 exists,
 and just the biggest available if all are smaller)
 3. assign the matched firm-observation the same quarters since issue
 as
 the IPO being matched
 4. rbind them all into the matching dataset.

 The function I currently have is pasted below, for your reference. Is
 there any way to make it produce the same result but much faster?
 Specifically, I am guessing eliminating some loops would be very good,
 but I
 don't see how, since I need to do some fancy footwork for each IPO in
 each
 quarter to find the matching firm. I'll be doing a few things similar to
 this, so it's somewhat important to up the efficiency of this. Maybe
 some of
 you R-fu masters can clue me in? :)

 I would appreciate any help, tips, tricks, tweaks, you name it! :)

 == my function below ===

 fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
 quarters_since_issue=40) {

   result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
 cheaper, so typecast the result to matrix

   colnames = names(tfdata)

   quarterends = sort(unique(tfdata$DATE))

   for (aquarter in quarterends) {
   tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

   tfdata_quarter_fitting_nonissuers = tfdata_quarter[
 (tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
   tfdata_quarter_ipoissuers = tfdata_quarter[
 tfdata_quarter$IPO.Flag
 == 1, ]

   for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
   arow = tfdata_quarter_ipoissuers[i,]
   industrypeers = tfdata_quarter_fitting_nonissuers[
 tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
   industrypeers = industrypeers[
 order(industrypeers$Market.Cap.13f), ]
   if ( nrow(industrypeers)  0 ) {
   if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
 arow$Market.Cap.13f, ])  0 ) {
   bestpeer = industrypeers[industrypeers$Market.Cap.13f

 = arow$Market.Cap.13f, ][1,]

   }
   else {
   bestpeer = industrypeers[nrow(industrypeers),]
   }
   bestpeer$Quarters.Since.IPO.Issue =
 arow$Quarters.Since.IPO.Issue

 #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
 bestpeer$PERMNO] = 1
   result = rbind(result, as.matrix(bestpeer))
   }
   }
   #result = rbind(result, tfdata_quarter)
   print (aquarter)
   }

   result = as.data.frame(result)
   names(result) = colnames
   return(result)

 }

 = end of my function =

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
i thought since the function code (which i provided in full) was pretty 
short, it would be reasonably easy to just read the code and see what 
it's doing.


but ok, so... i am attaching a zip file, with a small sample of the data 
set (tab delimited), and the function code, in a zip file (posting 
guidelines claim that some archive formats are allowed, i assume zip 
is one of them...


would appreciate your comments! :)

on 06/06/2008 12:05 PM Gabor Grothendieck said the following:

Its summarized in the last line to r-help.  Note reproducible and
minimal.

On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:

i did! what did i miss?

on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

Try reading the posting guide before posting.

On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn [EMAIL PROTECTED]
wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional
ownership of equities; some of them have had recent IPOs, some have not
(I
have a binary flag set). The total dataset size is 700k+ rows.

My goal is this: For every quarter since issue for each IPO, I need to
find a matched firm in the same industry, and close in market cap. So,
e.g., for firm X, which had an IPO, i need to find a matched non-issuing
firm in quarter 1 since IPO, then a (possibly different) non-issuing
firm in
quarter 2 since IPO, etc. Repeat for each issuing firm (there are about
8300
of these).

Thus it seems to me that I need to be doing a lot of data selection and
subsetting, and looping (yikes!), but the result appears to be highly
inefficient and takes ages (well, many hours). What I am doing, in
pseudocode, is this:

1. for each quarter of data, getting out all the IPOs and all the
eligible
non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same
industry, sort them by size, and finally grab a matching firm closest in
size (the exact procedure is to grab the closest bigger firm if one
exists,
and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue
as
the IPO being matched
4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is
there any way to make it produce the same result but much faster?
Specifically, I am guessing eliminating some loops would be very good,
but I
don't see how, since I need to do some fancy footwork for each IPO in
each
quarter to find the matching firm. I'll be doing a few things similar to
this, so it's somewhat important to up the efficiency of this. Maybe
some of
you R-fu masters can clue me in? :)

I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
quarters_since_issue=40) {

  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
cheaper, so typecast the result to matrix

  colnames = names(tfdata)

  quarterends = sort(unique(tfdata$DATE))

  for (aquarter in quarterends) {
  tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

  tfdata_quarter_fitting_nonissuers = tfdata_quarter[
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
(tfdata_quarter$IPO.Flag == 0), ]
  tfdata_quarter_ipoissuers = tfdata_quarter[
tfdata_quarter$IPO.Flag
== 1, ]

  for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
  arow = tfdata_quarter_ipoissuers[i,]
  industrypeers = tfdata_quarter_fitting_nonissuers[
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
  industrypeers = industrypeers[
order(industrypeers$Market.Cap.13f), ]
  if ( nrow(industrypeers)  0 ) {
  if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
arow$Market.Cap.13f, ])  0 ) {
  bestpeer = industrypeers[industrypeers$Market.Cap.13f

= arow$Market.Cap.13f, ][1,]

  }
  else {
  bestpeer = industrypeers[nrow(industrypeers),]
  }
  bestpeer$Quarters.Since.IPO.Issue =
arow$Quarters.Since.IPO.Issue

#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
bestpeer$PERMNO] = 1
  result = rbind(result, as.matrix(bestpeer))
  }
  }
  #result = rbind(result, tfdata_quarter)
  print (aquarter)
  }

  result = as.data.frame(result)
  names(result) = colnames
  return(result)

}

= end of my function =


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__

Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
thanks for the tip! i'll try that and see how big of a difference that 
makes... if i am not sure what exactly the size will be, am i better off 
making it larger, and then later stripping off the blank rows, or making 
it smaller, and appending the missing rows?


on 06/06/2008 11:44 AM Patrick Burns said the following:

One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it.  Something like:

  result[i, ] - bestpeer

(though I'm not sure if 'i' is the proper index).

Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Daniel Folkinshteyn wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have 
not (I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need 
to find a matched firm in the same industry, and close in market 
cap. So, e.g., for firm X, which had an IPO, i need to find a matched 
non-issuing firm in quarter 1 since IPO, then a (possibly different) 
non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing 
firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am 
doing, in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest 
in size (the exact procedure is to grab the closest bigger firm if 
one exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very 
good, but I don't see how, since I need to do some fancy footwork for 
each IPO in each quarter to find the matching firm. I'll be doing a 
few things similar to this, so it's somewhat important to up the 
efficiency of this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f 
= arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list

Re: [R] Improving data processing efficiency

2008-06-06 Thread Gabor Grothendieck
I think the posting guide may not be clear enough and have suggested that
it be clarified.  Hopefully this better communicates what is required and why
in a shorter amount of space:

https://stat.ethz.ch/pipermail/r-devel/2008-June/049891.html


On Fri, Jun 6, 2008 at 1:25 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:
 i thought since the function code (which i provided in full) was pretty
 short, it would be reasonably easy to just read the code and see what it's
 doing.

 but ok, so... i am attaching a zip file, with a small sample of the data set
 (tab delimited), and the function code, in a zip file (posting guidelines
 claim that some archive formats are allowed, i assume zip is one of
 them...

 would appreciate your comments! :)

 on 06/06/2008 12:05 PM Gabor Grothendieck said the following:

 Its summarized in the last line to r-help.  Note reproducible and
 minimal.

 On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn [EMAIL PROTECTED]
 wrote:

 i did! what did i miss?

 on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

 Try reading the posting guide before posting.

 On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn
 [EMAIL PROTECTED]
 wrote:

 Anybody have any thoughts on this? Please? :)

 on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

 Hi everyone!

 I have a question about data processing efficiency.

 My data are as follows: I have a data set on quarterly institutional
 ownership of equities; some of them have had recent IPOs, some have
 not
 (I
 have a binary flag set). The total dataset size is 700k+ rows.

 My goal is this: For every quarter since issue for each IPO, I need to
 find a matched firm in the same industry, and close in market cap.
 So,
 e.g., for firm X, which had an IPO, i need to find a matched
 non-issuing
 firm in quarter 1 since IPO, then a (possibly different) non-issuing
 firm in
 quarter 2 since IPO, etc. Repeat for each issuing firm (there are
 about
 8300
 of these).

 Thus it seems to me that I need to be doing a lot of data selection
 and
 subsetting, and looping (yikes!), but the result appears to be highly
 inefficient and takes ages (well, many hours). What I am doing, in
 pseudocode, is this:

 1. for each quarter of data, getting out all the IPOs and all the
 eligible
 non-issuing firms.
 2. for each IPO in a quarter, grab all the non-issuers in the same
 industry, sort them by size, and finally grab a matching firm closest
 in
 size (the exact procedure is to grab the closest bigger firm if one
 exists,
 and just the biggest available if all are smaller)
 3. assign the matched firm-observation the same quarters since issue
 as
 the IPO being matched
 4. rbind them all into the matching dataset.

 The function I currently have is pasted below, for your reference. Is
 there any way to make it produce the same result but much faster?
 Specifically, I am guessing eliminating some loops would be very good,
 but I
 don't see how, since I need to do some fancy footwork for each IPO in
 each
 quarter to find the matching firm. I'll be doing a few things similar
 to
 this, so it's somewhat important to up the efficiency of this. Maybe
 some of
 you R-fu masters can clue me in? :)

 I would appreciate any help, tips, tricks, tweaks, you name it! :)

 == my function below ===

 fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
 quarters_since_issue=40) {

  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
 cheaper, so typecast the result to matrix

  colnames = names(tfdata)

  quarterends = sort(unique(tfdata$DATE))

  for (aquarter in quarterends) {
  tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

  tfdata_quarter_fitting_nonissuers = tfdata_quarter[
 (tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
  tfdata_quarter_ipoissuers = tfdata_quarter[
 tfdata_quarter$IPO.Flag
 == 1, ]

  for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
  arow = tfdata_quarter_ipoissuers[i,]
  industrypeers = tfdata_quarter_fitting_nonissuers[
 tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
  industrypeers = industrypeers[
 order(industrypeers$Market.Cap.13f), ]
  if ( nrow(industrypeers)  0 ) {
  if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
 arow$Market.Cap.13f, ])  0 ) {
  bestpeer = industrypeers[industrypeers$Market.Cap.13f

 = arow$Market.Cap.13f, ][1,]

  }
  else {
  bestpeer = industrypeers[nrow(industrypeers),]
  }
  bestpeer$Quarters.Since.IPO.Issue =
 arow$Quarters.Since.IPO.Issue

 #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
 bestpeer$PERMNO] = 1
  result = rbind(result, as.matrix(bestpeer))
  }
  }
  #result = rbind(result, tfdata_quarter)
  print (aquarter)
  }

  result = as.data.frame(result)
  names(result) = colnames
  

Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
just in case, uploaded it to the server, you can get the zip file i 
mentioned here:

http://astro.temple.edu/~dfolkins/helplistfiles.zip

on 06/06/2008 01:25 PM Daniel Folkinshteyn said the following:
i thought since the function code (which i provided in full) was pretty 
short, it would be reasonably easy to just read the code and see what 
it's doing.


but ok, so... i am attaching a zip file, with a small sample of the data 
set (tab delimited), and the function code, in a zip file (posting 
guidelines claim that some archive formats are allowed, i assume zip 
is one of them...


would appreciate your comments! :)

on 06/06/2008 12:05 PM Gabor Grothendieck said the following:

Its summarized in the last line to r-help.  Note reproducible and
minimal.

On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn 
[EMAIL PROTECTED] wrote:

i did! what did i miss?

on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

Try reading the posting guide before posting.

On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn 
[EMAIL PROTECTED]

wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional
ownership of equities; some of them have had recent IPOs, some 
have not

(I
have a binary flag set). The total dataset size is 700k+ rows.

My goal is this: For every quarter since issue for each IPO, I 
need to
find a matched firm in the same industry, and close in market 
cap. So,
e.g., for firm X, which had an IPO, i need to find a matched 
non-issuing

firm in quarter 1 since IPO, then a (possibly different) non-issuing
firm in
quarter 2 since IPO, etc. Repeat for each issuing firm (there are 
about

8300
of these).

Thus it seems to me that I need to be doing a lot of data 
selection and

subsetting, and looping (yikes!), but the result appears to be highly
inefficient and takes ages (well, many hours). What I am doing, in
pseudocode, is this:

1. for each quarter of data, getting out all the IPOs and all the
eligible
non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same
industry, sort them by size, and finally grab a matching firm 
closest in

size (the exact procedure is to grab the closest bigger firm if one
exists,
and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue

as
the IPO being matched
4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is
there any way to make it produce the same result but much faster?
Specifically, I am guessing eliminating some loops would be very 
good,

but I
don't see how, since I need to do some fancy footwork for each IPO in
each
quarter to find the matching firm. I'll be doing a few things 
similar to

this, so it's somewhat important to up the efficiency of this. Maybe
some of
you R-fu masters can clue me in? :)

I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
quarters_since_issue=40) {

  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
cheaper, so typecast the result to matrix

  colnames = names(tfdata)

  quarterends = sort(unique(tfdata$DATE))

  for (aquarter in quarterends) {
  tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

  tfdata_quarter_fitting_nonissuers = tfdata_quarter[
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
(tfdata_quarter$IPO.Flag == 0), ]
  tfdata_quarter_ipoissuers = tfdata_quarter[
tfdata_quarter$IPO.Flag
== 1, ]

  for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
  arow = tfdata_quarter_ipoissuers[i,]
  industrypeers = tfdata_quarter_fitting_nonissuers[
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
  industrypeers = industrypeers[
order(industrypeers$Market.Cap.13f), ]
  if ( nrow(industrypeers)  0 ) {
  if ( nrow(industrypeers[industrypeers$Market.Cap.13f =
arow$Market.Cap.13f, ])  0 ) {
  bestpeer = 
industrypeers[industrypeers$Market.Cap.13f

= arow$Market.Cap.13f, ][1,]

  }
  else {
  bestpeer = industrypeers[nrow(industrypeers),]
  }
  bestpeer$Quarters.Since.IPO.Issue =
arow$Quarters.Since.IPO.Issue

#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
bestpeer$PERMNO] = 1
  result = rbind(result, as.matrix(bestpeer))
  }
  }
  #result = rbind(result, tfdata_quarter)
  print (aquarter)
  }

  result = as.data.frame(result)
  names(result) = colnames
  return(result)

}

= end of my function =


__
R-help@r-project.org mailing list

Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
Ok, sorry about the zip, then. :) Thanks for taking the trouble to clue 
me in as to the best posting procedure!


well, here's a dput-ed version of the small data subset you can use for 
testing. below that, an updated version of the function, with extra 
explanatory comments, and producing an extra column showing exactly what 
is matched to what.


to test, just run the function, with the dataset as sole argument.

Thanks again; i'd appreciate any input on this.

=== begin dataset dput representation 

structure(list(PERMNO = c(10001L, 10001L, 10298L, 10298L, 10484L,
10484L, 10515L, 10515L, 10634L, 10634L, 11048L, 11048L, 11237L,
11294L, 11294L, 11338L, 11338L, 11404L, 11404L, 11587L, 11587L,
11591L, 11591L, 11737L, 11737L, 11791L, 11809L, 11809L, 11858L,
11858L, 11955L, 11955L, 12003L, 12003L, 12016L, 12016L, 12223L,
12223L, 12758L, 12758L, 13688L, 13688L, 16117L, 16117L, 17770L,
17770L, 21514L, 21514L, 21792L, 21792L, 21821L, 21821L, 22437L,
22437L, 22947L, 22947L, 23027L, 23027L, 23182L, 23182L, 23536L,
23536L, 23712L, 23712L, 24053L, 24053L, 24117L, 24117L, 24256L,
24256L, 24299L, 24299L, 24352L, 24352L, 24379L, 24379L, 24467L,
24467L, 24679L, 24679L, 24870L, 24870L, 25056L, 25056L, 25208L,
25208L, 25232L, 25232L, 25241L, 25590L, 25590L, 26463L, 26463L,
26470L, 26470L, 26614L, 26614L, 27385L, 27385L, 29196L, 29196L,
30411L, 30411L, 32943L, 32943L, 38893L, 38893L, 40708L, 40708L,
41005L, 41005L, 42817L, 42817L, 42833L, 42833L, 43668L, 43668L,
45947L, 45947L, 46017L, 46017L, 48274L, 48274L, 49971L, 49971L,
53786L, 53786L, 53859L, 53859L, 54199L, 54199L, 56371L, 56952L,
56952L, 57277L, 57277L, 57381L, 57381L, 58202L, 58202L, 59395L,
59395L, 59935L, 60169L, 60169L, 61188L, 61188L, 61444L, 61444L,
62690L, 62690L, 62842L, 62842L, 64290L, 64290L, 64418L, 64418L,
64450L, 64450L, 64477L, 64477L, 64557L, 64557L, 64646L, 64646L,
64902L, 64902L, 67774L, 67774L, 68910L, 68910L, 70471L, 70471L,
74406L, 74406L, 75091L, 75091L, 75304L, 75304L, 75743L, 75964L,
75964L, 76026L, 76026L, 76162L, 76170L, 76173L, 78530L, 78530L,
78682L, 78682L, 81569L, 81569L, 82502L, 82502L, 83337L, 83337L,
83919L, 83919L, 88242L, 88242L, 90852L, 90852L, 91353L, 91353L
), DATE = c(19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630,
19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900630,
19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331,
19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630,
19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331,
19900630, 19900331, 19900331, 19900630, 19900331, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331,
19900630, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630,
19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630), Shares.Owned = c(50100,
50100, 25, 293500, 3656629, 3827119, 4132439, 3566591, 2631193,
2500301, 775879, 816879, 38700, 1041600, 1070300, 533768, 558815,
61384492, 60466567, 194595, 196979, 359946, 314446, 106770, 107070,
20242, 1935098, 2099403, 1902125, 1766750, 41991, 41991, 34490,
36290, 589400, 596700, 1549395, 1759440, 854473, 762903, 156366785,
98780287, 2486389, 2635718, 122264, 122292, 25455916, 25458658,
71645490, 71855722, 30969596, 30409838, 2738576, 2814490, 20846605,
20930233, 1148299, 505415, 396388, 385714, 25239923, 24117950,
73465526, 73084616, 8096614, 7595742, 3937930, 3820215, 20884821,
19456342, 2127331, 2188276, 2334515, 2813347, 8267264, 8544084,
783277, 810742, 742048, 512956, 9659658, 9436873, 

Re: [R] Improving data processing efficiency

2008-06-06 Thread Patrick Burns

That is going to be situation dependent, but if you
have a reasonable upper bound, then that will be
much easier and not far from optimal.

If you pick the possibly too small route, then increasing
the size in largish junks is much better than adding
a row at a time.

Pat

Daniel Folkinshteyn wrote:
thanks for the tip! i'll try that and see how big of a difference that 
makes... if i am not sure what exactly the size will be, am i better 
off making it larger, and then later stripping off the blank rows, or 
making it smaller, and appending the missing rows?


on 06/06/2008 11:44 AM Patrick Burns said the following:

One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it.  Something like:

  result[i, ] - bestpeer

(though I'm not sure if 'i' is the proper index).

Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Daniel Folkinshteyn wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly 
institutional ownership of equities; some of them have had recent 
IPOs, some have not (I have a binary flag set). The total dataset 
size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need 
to find a matched firm in the same industry, and close in market 
cap. So, e.g., for firm X, which had an IPO, i need to find a 
matched non-issuing firm in quarter 1 since IPO, then a (possibly 
different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
each issuing firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am 
doing, in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm 
closest in size (the exact procedure is to grab the closest bigger 
firm if one exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. 
Is there any way to make it produce the same result but much 
faster? Specifically, I am guessing eliminating some loops would be 
very good, but I don't see how, since I need to do some fancy 
footwork for each IPO in each quarter to find the matching firm. 
I'll be doing a few things similar to this, so it's somewhat 
important to up the efficiency of this. Maybe some of you R-fu 
masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = 
function(tfdata, quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix 
is cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( 
nrow(industrypeers[industrypeers$Market.Cap.13f = 
arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, 
][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =



__
R-help@r-project.org mailing 

Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
Cool, I do have an upper bound, so I'll try it and how much of a 
speedboost it gives me. Thanks for the suggestion!


on 06/06/2008 02:03 PM Patrick Burns said the following:

That is going to be situation dependent, but if you
have a reasonable upper bound, then that will be
much easier and not far from optimal.

If you pick the possibly too small route, then increasing
the size in largish junks is much better than adding
a row at a time.

Pat

Daniel Folkinshteyn wrote:
thanks for the tip! i'll try that and see how big of a difference that 
makes... if i am not sure what exactly the size will be, am i better 
off making it larger, and then later stripping off the blank rows, or 
making it smaller, and appending the missing rows?


on 06/06/2008 11:44 AM Patrick Burns said the following:

One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it.  Something like:

  result[i, ] - bestpeer

(though I'm not sure if 'i' is the proper index).

Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Daniel Folkinshteyn wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly 
institutional ownership of equities; some of them have had recent 
IPOs, some have not (I have a binary flag set). The total dataset 
size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need 
to find a matched firm in the same industry, and close in market 
cap. So, e.g., for firm X, which had an IPO, i need to find a 
matched non-issuing firm in quarter 1 since IPO, then a (possibly 
different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
each issuing firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am 
doing, in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm 
closest in size (the exact procedure is to grab the closest bigger 
firm if one exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. 
Is there any way to make it produce the same result but much 
faster? Specifically, I am guessing eliminating some loops would be 
very good, but I don't see how, since I need to do some fancy 
footwork for each IPO in each quarter to find the matching firm. 
I'll be doing a few things similar to this, so it's somewhat 
important to up the efficiency of this. Maybe some of you R-fu 
masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = 
function(tfdata, quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix 
is cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( 
nrow(industrypeers[industrypeers$Market.Cap.13f = 
arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, 
][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = 

Re: [R] Improving data processing efficiency

2008-06-06 Thread Greg Snow
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns
 Sent: Friday, June 06, 2008 12:04 PM
 To: Daniel Folkinshteyn
 Cc: r-help@r-project.org
 Subject: Re: [R] Improving data processing efficiency

 That is going to be situation dependent, but if you have a
 reasonable upper bound, then that will be much easier and not
 far from optimal.

 If you pick the possibly too small route, then increasing the
 size in largish junks is much better than adding a row at a time.

Pat,

I am unfamiliar with the use of the word junk as a unit of measure for data 
objects.  I figure there are a few different possibilities:

1. You are using the term intentionally meaning that you suggest he increases 
the size in terms of old cars and broken pianos rather than used up pens and 
broken pencils.

2. This was a Freudian slip based on your opinion of some datasets you have 
seen.

3. Somewhere between your mind and the final product jumps/chunks became 
junks (possibly a microsoft correction, or just typing too fast combined 
with number 2).

4. junks is an official measure of data/object size that I need to learn more 
about (the history of the term possibly being related to 2 and 3 above).

Please let it be #4, I would love to be able to tell some clients that I have 
received a junk of data from them.


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Gabor Grothendieck
On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote:
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns
 Sent: Friday, June 06, 2008 12:04 PM
 To: Daniel Folkinshteyn
 Cc: r-help@r-project.org
 Subject: Re: [R] Improving data processing efficiency

 That is going to be situation dependent, but if you have a
 reasonable upper bound, then that will be much easier and not
 far from optimal.

 If you pick the possibly too small route, then increasing the
 size in largish junks is much better than adding a row at a time.

 Pat,

 I am unfamiliar with the use of the word junk as a unit of measure for data 
 objects.  I figure there are a few different possibilities:

 1. You are using the term intentionally meaning that you suggest he increases 
 the size in terms of old cars and broken pianos rather than used up pens and 
 broken pencils.

 2. This was a Freudian slip based on your opinion of some datasets you have 
 seen.

 3. Somewhere between your mind and the final product jumps/chunks became 
 junks (possibly a microsoft correction, or just typing too fast combined 
 with number 2).

 4. junks is an official measure of data/object size that I need to learn 
 more about (the history of the term possibly being related to 2 and 3 above).


5. Chinese sailing vessel.
http://en.wikipedia.org/wiki/Junk_(ship)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Patrick Burns

My guess is that number 2 is closest to the mark.
Typing too fast is unfortunately not one of my
habitual attributes.

Gabor Grothendieck wrote:

On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote:
  

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns
Sent: Friday, June 06, 2008 12:04 PM
To: Daniel Folkinshteyn
Cc: r-help@r-project.org
Subject: Re: [R] Improving data processing efficiency

That is going to be situation dependent, but if you have a
reasonable upper bound, then that will be much easier and not
far from optimal.

If you pick the possibly too small route, then increasing the
size in largish junks is much better than adding a row at a time.
  

Pat,

I am unfamiliar with the use of the word junk as a unit of measure for data 
objects.  I figure there are a few different possibilities:

1. You are using the term intentionally meaning that you suggest he increases 
the size in terms of old cars and broken pianos rather than used up pens and 
broken pencils.

2. This was a Freudian slip based on your opinion of some datasets you have 
seen.

3. Somewhere between your mind and the final product jumps/chunks became junks 
(possibly a microsoft correction, or just typing too fast combined with number 2).

4. junks is an official measure of data/object size that I need to learn more 
about (the history of the term possibly being related to 2 and 3 above).




5. Chinese sailing vessel.
http://en.wikipedia.org/wiki/Junk_(ship)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Greg Snow
 -Original Message-
 From: Gabor Grothendieck [mailto:[EMAIL PROTECTED]
 Sent: Friday, June 06, 2008 12:33 PM
 To: Greg Snow
 Cc: Patrick Burns; Daniel Folkinshteyn; r-help@r-project.org
 Subject: Re: [R] Improving data processing efficiency

 On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow [EMAIL PROTECTED] wrote:
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Burns
  Sent: Friday, June 06, 2008 12:04 PM
  To: Daniel Folkinshteyn
  Cc: r-help@r-project.org
  Subject: Re: [R] Improving data processing efficiency
 
  That is going to be situation dependent, but if you have a
 reasonable
  upper bound, then that will be much easier and not far
 from optimal.
 
  If you pick the possibly too small route, then increasing
 the size in
  largish junks is much better than adding a row at a time.
 
  Pat,
 
  I am unfamiliar with the use of the word junk as a unit
 of measure for data objects.  I figure there are a few
 different possibilities:
 
  1. You are using the term intentionally meaning that you
 suggest he increases the size in terms of old cars and broken
 pianos rather than used up pens and broken pencils.
 
  2. This was a Freudian slip based on your opinion of some
 datasets you have seen.
 
  3. Somewhere between your mind and the final product
 jumps/chunks became junks (possibly a microsoft
 correction, or just typing too fast combined with number 2).
 
  4. junks is an official measure of data/object size that
 I need to learn more about (the history of the term possibly
 being related to 2 and 3 above).
 

 5. Chinese sailing vessel.
 http://en.wikipedia.org/wiki/Junk_(ship)


Thanks for expanding my vocabulary (hmm, how am I going to use that word in 
context today?).

So, if 5 is the case, then Pat's original statement can be reworded as:

If you pick the possibly too small route, then increasing the size in
largish Chinese sailing vessels is much better than adding a row boat at a 
time.

While that is probably true, I am not sure what that would mean in terms of the 
original data processing question.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
Hmm... ok... so i ran the code twice - once with a preallocated result, 
assigning rows to it, and once with a nrow=0 result, rbinding rows to 
it, for the first 20 quarters. There was no speedup. In fact, running 
with a preallocated result matrix was slower than rbinding to the matrix:


for preallocated matrix:
Time difference of 1.59 mins

for rbinding:
Time difference of 1.498628 mins

(the time difference only counts from the start of the loop til the end, 
so the time to allocate the empty matrix was /not/ included in the time 
count).


So, it appears that rbinding a matrix is not the bottleneck. (That it 
was actually faster than assigning rows could have been a random anomaly 
(e.g. some other process eating a bit of cpu during the run?), or not - 
at any rate, it doesn't make an /appreciable/ difference.


Any other suggestions? :)

on 06/06/2008 02:03 PM Patrick Burns said the following:

That is going to be situation dependent, but if you
have a reasonable upper bound, then that will be
much easier and not far from optimal.

If you pick the possibly too small route, then increasing
the size in largish junks is much better than adding
a row at a time.

Pat

Daniel Folkinshteyn wrote:
thanks for the tip! i'll try that and see how big of a difference that 
makes... if i am not sure what exactly the size will be, am i better 
off making it larger, and then later stripping off the blank rows, or 
making it smaller, and appending the missing rows?


on 06/06/2008 11:44 AM Patrick Burns said the following:

One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it.  Something like:

  result[i, ] - bestpeer

(though I'm not sure if 'i' is the proper index).

Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Daniel Folkinshteyn wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly 
institutional ownership of equities; some of them have had recent 
IPOs, some have not (I have a binary flag set). The total dataset 
size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need 
to find a matched firm in the same industry, and close in market 
cap. So, e.g., for firm X, which had an IPO, i need to find a 
matched non-issuing firm in quarter 1 since IPO, then a (possibly 
different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
each issuing firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am 
doing, in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm 
closest in size (the exact procedure is to grab the closest bigger 
firm if one exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since 
issue as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. 
Is there any way to make it produce the same result but much 
faster? Specifically, I am guessing eliminating some loops would be 
very good, but I don't see how, since I need to do some fancy 
footwork for each IPO in each quarter to find the matching firm. 
I'll be doing a few things similar to this, so it's somewhat 
important to up the efficiency of this. Maybe some of you R-fu 
masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = 
function(tfdata, quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix 
is cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue) 
 (tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]


Re: [R] Improving data processing efficiency

2008-06-06 Thread hadley wickham
On Fri, Jun 6, 2008 at 5:10 PM, Daniel Folkinshteyn [EMAIL PROTECTED] wrote:
 Hmm... ok... so i ran the code twice - once with a preallocated result,
 assigning rows to it, and once with a nrow=0 result, rbinding rows to it,
 for the first 20 quarters. There was no speedup. In fact, running with a
 preallocated result matrix was slower than rbinding to the matrix:

 for preallocated matrix:
 Time difference of 1.59 mins

 for rbinding:
 Time difference of 1.498628 mins

 (the time difference only counts from the start of the loop til the end, so
 the time to allocate the empty matrix was /not/ included in the time count).

 So, it appears that rbinding a matrix is not the bottleneck. (That it was
 actually faster than assigning rows could have been a random anomaly (e.g.
 some other process eating a bit of cpu during the run?), or not - at any
 rate, it doesn't make an /appreciable/ difference.

Why not try profiling?  The profr package provides an alternative
display that I find more helpful than the default tools:

install.packages(profr)
library(profr)
p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
plot(p)

That should at least help you see where the slow bits are.

Hadley

-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn
thanks for the suggestions! I'll play with this over the weekend and see 
what comes out. :)


on 06/06/2008 06:48 PM Don MacQueen said the following:
In a case like this, if you can possibly work with matrices instead of 
data frames, you might get significant speedup.
(More accurately, I have had situations where I obtained speed up by 
working with matrices instead of dataframes.)

Even if you have to code character columns as numeric, it can be worth it.

Data frames have overhead that matrices do not. (Here's where profiling 
might have given a clue) Granted, there has been recent work in reducing 
the overhead associated with dataframes, but I think it's worth a try. 
Carrying along extra columns and doing row subsetting, rbinding, etc, 
means a lot more things happening in memory.


So, for example, if all of your matching is based just on a few columns, 
extract those columns, convert them to a matrix, do all the matching, 
and then based on some sort of row index retrieve all of the associated 
columns.


-Don

At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have 
not (I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need to 
find a matched firm in the same industry, and close in market cap. 
So, e.g., for firm X, which had an IPO, i need to find a matched 
non-issuing firm in quarter 1 since IPO, then a (possibly different) 
non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing 
firm (there are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection 
and subsetting, and looping (yikes!), but the result appears to be 
highly inefficient and takes ages (well, many hours). What I am doing, 
in pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest 
in size (the exact procedure is to grab the closest bigger firm if one 
exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue 
as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very good, 
but I don't see how, since I need to do some fancy footwork for each 
IPO in each quarter to find the matching firm. I'll be doing a few 
things similar to this, so it's somewhat important to up the 
efficiency of this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f 
= arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =

__
R-help@r-project.org mailing list

Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn

on 06/06/2008 06:55 PM hadley wickham said the following:

Why not try profiling?  The profr package provides an alternative
display that I find more helpful than the default tools:

install.packages(profr)
library(profr)
p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
plot(p)

That should at least help you see where the slow bits are.


i'll give it a try over the weekend! thanks!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Daniel Folkinshteyn

install.packages(profr)
library(profr)
p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
plot(p)

That should at least help you see where the slow bits are.

Hadley

so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are 
the biggest timesuckers...


i suppose i'll try using matrices and see how that stacks up (since all 
my cols are numeric, should be a problem-free approach).


but i'm really wondering if there isn't some neat vectorized approach i 
could use to avoid at least one of the nested loops...


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Horace Tso
Daniel, allow me to step off the party line here for a moment, in a problem 
like this it's better to code your function in C and then call it from R. You 
get vast amount of performance improvement instantly. (From what I see the 
process of recoding in C should be quite straight forward.)

H.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Daniel 
Folkinshteyn
Sent: Friday, June 06, 2008 4:35 PM
To: hadley wickham
Cc: r-help@r-project.org; Patrick Burns
Subject: Re: [R] Improving data processing efficiency

 install.packages(profr)
 library(profr)
 p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
 plot(p)

 That should at least help you see where the slow bits are.

 Hadley

so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are
the biggest timesuckers...

i suppose i'll try using matrices and see how that stacks up (since all
my cols are numeric, should be a problem-free approach).

but i'm really wondering if there isn't some neat vectorized approach i
could use to avoid at least one of the nested loops...

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Esmail Bonakdarian

hadley wickham wrote:




Hi,

I tried this suggestion as I am curious about bottlenecks in my own
R code ...


Why not try profiling?  The profr package provides an alternative
display that I find more helpful than the default tools:

install.packages(profr)


 install.packages(profr)
Warning message:
package ‘profr’ is not available


any ideas?

Thanks,
Esmail



library(profr)
p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
plot(p)

That should at least help you see where the slow bits are.

Hadley



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Esmail Bonakdarian

Esmail Bonakdarian wrote:

hadley wickham wrote:




Hi,

I tried this suggestion as I am curious about bottlenecks in my own
R code ...


Why not try profiling?  The profr package provides an alternative
display that I find more helpful than the default tools:

install.packages(profr)


  install.packages(profr)
Warning message:
package ‘profr’ is not available


I selected a different mirror in place of the Iowa one and it
worked. Odd, I just assumed all the same packages are available
on all mirrors.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread Charles C. Berry

On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote:


 install.packages(profr)
 library(profr)
 p - profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
 plot(p)

 That should at least help you see where the slow bits are.

 Hadley

so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the 
biggest timesuckers...


i suppose i'll try using matrices and see how that stacks up (since all my 
cols are numeric, should be a problem-free approach).


but i'm really wondering if there isn't some neat vectorized approach i could 
use to avoid at least one of the nested loops...





As far as a vectorized solution, I'll bet you could do ALL the lookups of 
non-issuers for all issuers with a single call to findInterval() (modulo 
some cleanup afterwards) , but the trickery needed to do that would make 
your code a bit opaque.


And in the end I doubt it would beat mapply() (read on...) by enough to 
make it worthwhile.


---

What you are doing is conditional on industry group and quarter.

So using

indus.quarter - with(tfdat,
paste(as.character(DATE), as.character(HSICIG), sep=.)))

and then calls like this:

split( various , indus.quater[ relevant.subset ] )

you can create:

a list of all issuer market caps according to quarter and group,

a list of all non-issuer caps (that satisfy your 'since quarter'
restriction) according to quarter and group,

a list of all non issuer indexes (i.e. row numbers) that satisfy
that restriction according to quarter and group

Then you write a function that takes the elements of each list for a given 
quarter-industry group, looks up the matching non-issuers for each issuer, 
and returns their indexes.


findInterval() will allow you to do this lookup for all issuers in one 
industry group in a given quarter simultaneously and greatly speed this 
process (but you will need to deal with the possible non-uniqueness of the 
non-issuer caps - perhaps by adding a tiny jitter() to the values).


Then you feed the function and the lists to mapply().

The result is a list of indexes on the original data.frame. You can 
unsplit() this if you like, then use those indexes to build your final 
result data.frame.


HTH,

Chuck


p.s. and if this all seems like too much work, you should at least avoid 
needlessly creating data.frames. Specifically


reorder things so that

   industrypeers = etc

is only done ONCE for each industry group by quarter combination and 
change stuff like


nrow(industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ])  0

to

any( industrypeers$Market.Cap.13f = arow$Market.Cap.13f )






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Charles C. Berry(858) 534-2098
Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-06 Thread hadley wickham
   install.packages(profr)
 Warning message:
 package 'profr' is not available

 I selected a different mirror in place of the Iowa one and it
 worked. Odd, I just assumed all the same packages are available
 on all mirrors.

The Iowa mirror is rather out of date as the guy who was looking after
it passed away.

Hadley


-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Improving data processing efficiency

2008-06-05 Thread Daniel Folkinshteyn

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have not 
(I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need to 
find a matched firm in the same industry, and close in market cap. So, 
e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
firm in quarter 1 since IPO, then a (possibly different) non-issuing 
firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection and 
subsetting, and looping (yikes!), but the result appears to be highly 
inefficient and takes ages (well, many hours). What I am doing, in 
pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest in 
size (the exact procedure is to grab the closest bigger firm if one 
exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue 
as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very good, 
but I don't see how, since I need to do some fancy footwork for each IPO 
in each quarter to find the matching firm. I'll be doing a few things 
similar to this, so it's somewhat important to up the efficiency of 
this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


colnames = names(tfdata)

quarterends = sort(unique(tfdata$DATE))

for (aquarter in quarterends) {
tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

if ( nrow(industrypeers)  0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f = 
arow$Market.Cap.13f, ])  0 ) {
bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

}
else {
bestpeer = industrypeers[nrow(industrypeers),]
}
bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue


#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

result = rbind(result, as.matrix(bestpeer))
}
}
#result = rbind(result, tfdata_quarter)
print (aquarter)
}

result = as.data.frame(result)
names(result) = colnames
return(result)

}

= end of my function =

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-05 Thread bartjoosen

Maybe you should provide a minimal, working code with data, so that we all
can give it a try.
In the mean time: take a look at the Rprof function to see where your code
can be improved.

Good luck

Bart


Daniel Folkinshteyn-2 wrote:
 
 Hi everyone!
 
 I have a question about data processing efficiency.
 
 My data are as follows: I have a data set on quarterly institutional 
 ownership of equities; some of them have had recent IPOs, some have not 
 (I have a binary flag set). The total dataset size is 700k+ rows.
 
 My goal is this: For every quarter since issue for each IPO, I need to 
 find a matched firm in the same industry, and close in market cap. So, 
 e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
 firm in quarter 1 since IPO, then a (possibly different) non-issuing 
 firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
 are about 8300 of these).
 
 Thus it seems to me that I need to be doing a lot of data selection and 
 subsetting, and looping (yikes!), but the result appears to be highly 
 inefficient and takes ages (well, many hours). What I am doing, in 
 pseudocode, is this:
 
 1. for each quarter of data, getting out all the IPOs and all the 
 eligible non-issuing firms.
 2. for each IPO in a quarter, grab all the non-issuers in the same 
 industry, sort them by size, and finally grab a matching firm closest in 
 size (the exact procedure is to grab the closest bigger firm if one 
 exists, and just the biggest available if all are smaller)
 3. assign the matched firm-observation the same quarters since issue 
 as the IPO being matched
 4. rbind them all into the matching dataset.
 
 The function I currently have is pasted below, for your reference. Is 
 there any way to make it produce the same result but much faster? 
 Specifically, I am guessing eliminating some loops would be very good, 
 but I don't see how, since I need to do some fancy footwork for each IPO 
 in each quarter to find the matching firm. I'll be doing a few things 
 similar to this, so it's somewhat important to up the efficiency of 
 this. Maybe some of you R-fu masters can clue me in? :)
 
 I would appreciate any help, tips, tricks, tweaks, you name it! :)
 
 == my function below ===
 
 fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
 quarters_since_issue=40) {
 
  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
 cheaper, so typecast the result to matrix
 
  colnames = names(tfdata)
 
  quarterends = sort(unique(tfdata$DATE))
 
  for (aquarter in quarterends) {
  tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
 
  tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
 (tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
 (tfdata_quarter$IPO.Flag == 0), ]
  tfdata_quarter_ipoissuers = tfdata_quarter[ 
 tfdata_quarter$IPO.Flag == 1, ]
 
  for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
  arow = tfdata_quarter_ipoissuers[i,]
  industrypeers = tfdata_quarter_fitting_nonissuers[ 
 tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
  industrypeers = industrypeers[ 
 order(industrypeers$Market.Cap.13f), ]
  if ( nrow(industrypeers)  0 ) {
  if ( nrow(industrypeers[industrypeers$Market.Cap.13f = 
 arow$Market.Cap.13f, ])  0 ) {
  bestpeer = 
 industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]
  }
  else {
  bestpeer = industrypeers[nrow(industrypeers),]
  }
  bestpeer$Quarters.Since.IPO.Issue = 
 arow$Quarters.Since.IPO.Issue
  
 #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
 bestpeer$PERMNO] = 1
  result = rbind(result, as.matrix(bestpeer))
  }
  }
  #result = rbind(result, tfdata_quarter)
  print (aquarter)
  }
 
  result = as.data.frame(result)
  names(result) = colnames
  return(result)
 
 }
 
 = end of my function =
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Improving-data-processing-efficiency-tp17676300p17678034.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Improving data processing efficiency

2008-06-05 Thread Daniel Folkinshteyn
Thanks, I'll take a look at Rprof... but I think what i'm missing is 
facility with R idiom to get around the looping, and no amount of 
profiling will help me with that :)


also, full working code is provided in my original post (see toward the 
bottom).


on 06/05/2008 03:43 PM bartjoosen said the following:

Maybe you should provide a minimal, working code with data, so that we all
can give it a try.
In the mean time: take a look at the Rprof function to see where your code
can be improved.

Good luck

Bart


Daniel Folkinshteyn-2 wrote:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have not 
(I have a binary flag set). The total dataset size is 700k+ rows.


My goal is this: For every quarter since issue for each IPO, I need to 
find a matched firm in the same industry, and close in market cap. So, 
e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
firm in quarter 1 since IPO, then a (possibly different) non-issuing 
firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
are about 8300 of these).


Thus it seems to me that I need to be doing a lot of data selection and 
subsetting, and looping (yikes!), but the result appears to be highly 
inefficient and takes ages (well, many hours). What I am doing, in 
pseudocode, is this:


1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest in 
size (the exact procedure is to grab the closest bigger firm if one 
exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same quarters since issue 
as the IPO being matched

4. rbind them all into the matching dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very good, 
but I don't see how, since I need to do some fancy footwork for each IPO 
in each quarter to find the matching firm. I'll be doing a few things 
similar to this, so it's somewhat important to up the efficiency of 
this. Maybe some of you R-fu masters can clue me in? :)


I would appreciate any help, tips, tricks, tweaks, you name it! :)

== my function below ===

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {


 result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix


 colnames = names(tfdata)

 quarterends = sort(unique(tfdata$DATE))

 for (aquarter in quarterends) {
 tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

 tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue  quarters_since_issue)  
(tfdata_quarter$IPO.Flag == 0), ]
 tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]


 for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
 arow = tfdata_quarter_ipoissuers[i,]
 industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
 industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]

 if ( nrow(industrypeers)  0 ) {
 if ( nrow(industrypeers[industrypeers$Market.Cap.13f = 
arow$Market.Cap.13f, ])  0 ) {
 bestpeer = 
industrypeers[industrypeers$Market.Cap.13f = arow$Market.Cap.13f, ][1,]

 }
 else {
 bestpeer = industrypeers[nrow(industrypeers),]
 }
 bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue
 
#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1

 result = rbind(result, as.matrix(bestpeer))
 }
 }
 #result = rbind(result, tfdata_quarter)
 print (aquarter)
 }

 result = as.data.frame(result)
 names(result) = colnames
 return(result)

}

= end of my function =

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.