Re: [R] Improving data processing efficiency

Charles C. Berry Fri, 06 Jun 2008 17:24:35 -0700

On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote:

 install.packages("profr")
 library(profr)
 p <- profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
 plot(p)
 That should at least help you see where the slow bits are.

 Hadley
so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are thebiggest timesuckers...
i suppose i'll try using matrices and see how that stacks up (since all mycols are numeric, should be a problem-free approach).
but i'm really wondering if there isn't some neat vectorized approach i coulduse to avoid at least one of the nested loops...

As far as a vectorized solution, I'll bet you could do ALL the lookups ofnon-issuers for all issuers with a single call to findInterval() (modulosome cleanup afterwards) , but the trickery needed to do that would makeyour code a bit opaque.

And in the end I doubt it would beat mapply() (read on...) by enough tomake it worthwhile.


---

What you are doing is conditional on industry group and quarter.

So using

        indus.quarter <- with(tfdat,
                paste(as.character(DATE), as.character(HSICIG), sep=".")))

and then calls like this:

        split( <various> , indus.quater[ relevant.subset ] )

you can create:

        a list of all issuer market caps according to quarter and group,

        a list of all non-issuer caps (that satisfy your 'since quarter'
        restriction) according to quarter and group,

        a list of all non issuer indexes (i.e. row numbers) that satisfy
        that restriction according to quarter and group

Then you write a function that takes the elements of each list for a givenquarter-industry group, looks up the matching non-issuers for each issuer,and returns their indexes.

findInterval() will allow you to do this lookup for all issuers in oneindustry group in a given quarter simultaneously and greatly speed thisprocess (but you will need to deal with the possible non-uniqueness of thenon-issuer caps - perhaps by adding a tiny jitter() to the values).


Then you feed the function and the lists to mapply().

The result is a list of indexes on the original data.frame. You canunsplit() this if you like, then use those indexes to build your final"result" data.frame.


HTH,

Chuck

p.s. and if this all seems like too much work, you should at least avoidneedlessly creating data.frames. Specifically


reorder things so that

           industrypeers = <etc>

is only done ONCE for each industry group by quarter combination andchange stuff like


nrow(industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ]) > 0

to

any( industrypeers$Market.Cap.13f >= arow$Market.Cap.13f )

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Improving data processing efficiency

Reply via email to