Oh. In that case, I'd suggest checking only using the month and year, not the day. You'll get some false positives, but the data should be small enough to merge, I guess. It depends on your application whether that's tolerable.
--Frank On Sun, Mar 15, 2015 at 6:59 PM, Nathaniel Graham <[email protected]> wrote: > Thanks for the suggestion! Using a data.table of all possible meeting > dates & branches and joining it to the employment history didn't occur to > me. Unfortunately, even after tinkering with it for a bit, the join (even > though it's a temporary structure) isn't feasible due to memory > usage--meet.stratton and employ.hist joined produce a table of billions of > rows. So I guess I was wrong about memory not being an issue! > > A note about terminology, because I wasn't very clear: I define a Stratton > 'alum' as someone that actually at Stratton-Oakmont; I don't have a term > for the brokers that Stratton alums later meet, even though they're the > ones I need to find. > > In case someone stumbles across this later: > > The meet.stratton table of possible dates and branches is specified as > (and again, I hope the formatting comes through): > > meet.stratton <- unique(employ.hist[icrdn %in% stratton.people, > list(branch, date = > as.Date(fromdate:todate)), > by = "job.index"], > by = c("branch", "date")) > > The unique() call is important to get right. Obviously, Frank didn't have > the opportunity to experiment with the data (it's too big to pass around, > and it's built from proprietary data). Also, I use the branch rather than > the whole firm, as it's not so clear that just working at the same firm is > meaningful--many broker-dealers have branches all over the country. It's > also probably easier to drop Stratton people from the final results > explicitly, doing something like: > > met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)] > > I'm thinking about cooking something up using foverlaps(), although I'll > need to learn its ins and outs first. > > ------- > Nathaniel Graham > [email protected] > [email protected] > https://sites.google.com/site/npgraham1/ > > On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson <[email protected]> > wrote: > >> I'd suggest: >> >> (1) Get a table identifying the condition "meeting someone who at some >> point works at Stratton." (They aren't really "alums" if they haven't >> worked there yet, but this is the definition you seem to be looking for.) >> You can do this by looking at any (firm,date) combinations that involve >> bumping into such a person: >> >> meet.stratton <- unique(employ.hist[icrdn %in% >> stratton.people,list(fcrdn,date=fromdate:todate)]) >> >> (2) Find people who meet the conditions: >> >> setkey(employ.hist,fcrdn) >> met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & >> date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)] >> >> (3) If you want to exclude Stratton folks, then use setdiff() >> >> --Frank >> >> On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <[email protected]> >> wrote: >> >>> There's particular problem I often have, and I'm hoping someone can tell >>> me how to speed it up in data.table. It seems to involve a sort of >>> recursion that data.table (as I'm using it) doesn't do well with, where for >>> each record in a set, I do a another search within the same table. I hope >>> the formatting of the code below is legible--it's a lot easier to read in >>> the RStudio text editor! >>> >>> I have a moderately large (more than 3 million rows) data.table of the >>> employment histories of brokers in the US. Each row is an employment >>> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a >>> branch identifier (branch), start and end dates (fromdate and todate), and >>> a few other items (each row has a unique id as well, called job.index). >>> For example, finding all the brokers that ever worked at Stratton Oakmont >>> (from the Wolf of Wall Street): >>> >>> employ.hist[fcrdn == 18692, icrdn] >>> >>> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is >>> the individual identifier. >>> >>> What I want is to find all the individuals that ever met a Stratton >>> alum. Specifically, every icrdn such that the branch == a branch a >>> Stratton alum ever worked at and the start and end dates overlap. The only >>> way I've found to do so involves something like this: >>> >>> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) { >>> employ.hist[fromdate <= sdt & todate >= edt & branch == brnch, >>> list(icrdn, branch, job.index, fcrdn)] >>> }) >>> >>> stratton.people <- employ.hist[fcrdn == 18692, icrdn] >>> stratton.contacts <- employ.hist[icrdn %in% stratton.people, >>> find_brokers_by_single_branch(fromdate, >>> todate, branch), >>> by = "job.index"] >>> >>> This works, but effectively means calling the data.table '[' function >>> thousands of times, once for each job entry >>> a Stratton broker ever had (which are in the thousands, as many left >>> before the government busted the place >>> and are still in the industry). It's quite slow, and I'm hoping someone >>> can show me a way to speed it up, as I have >>> many similar tasks, some of which are vastly larger. Memory really >>> isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 >>> 3.4GHz), in case that helps. >>> >>> ------- >>> Nathaniel Graham >>> [email protected] >>> [email protected] >>> https://sites.google.com/site/npgraham1/ >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [email protected] >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
