Oh. In that case, I'd suggest checking only using the month and year, not
the day. You'll get some false positives, but the data should be small
enough to merge, I guess. It depends on your application whether that's
tolerable.

--Frank

On Sun, Mar 15, 2015 at 6:59 PM, Nathaniel Graham <[email protected]>
wrote:

> Thanks for the suggestion!  Using a data.table of all possible meeting
> dates & branches and joining it to the employment history didn't occur to
> me.  Unfortunately, even after tinkering with it for a bit, the join (even
> though it's a temporary structure) isn't feasible due to memory
> usage--meet.stratton and employ.hist joined produce a table of billions of
> rows.  So I guess I was wrong about memory not being an issue!
>
> A note about terminology, because I wasn't very clear: I define a Stratton
> 'alum' as someone that actually at Stratton-Oakmont; I don't have a term
> for the brokers that Stratton alums later meet, even though they're the
> ones I need to find.
>
> In case someone stumbles across this later:
>
> The meet.stratton table of possible dates and branches is specified as
> (and again, I hope the formatting comes through):
>
> meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,
>                                     list(branch, date =
> as.Date(fromdate:todate)),
>                                     by = "job.index"],
>                         by = c("branch", "date"))
>
> The unique() call is important to get right.  Obviously, Frank didn't have
> the opportunity to experiment with the data (it's too big to pass around,
> and it's built from proprietary data).  Also, I use the branch rather than
> the whole firm, as it's not so clear that just working at the same firm is
> meaningful--many broker-dealers have branches all over the country.  It's
> also probably easier to drop Stratton people from the final results
> explicitly, doing something like:
>
> met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)]
>
> I'm thinking about cooking something up using foverlaps(), although I'll
> need to learn its ins and outs first.
>
> -------
> Nathaniel Graham
> [email protected]
> [email protected]
> https://sites.google.com/site/npgraham1/
>
> On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson <[email protected]>
> wrote:
>
>> I'd suggest:
>>
>> (1) Get a table identifying the condition "meeting someone who at some
>> point works at Stratton." (They aren't really "alums" if they haven't
>> worked there yet, but this is the definition you seem to be looking for.)
>> You can do this by looking at any (firm,date) combinations that involve
>> bumping into such a person:
>>
>> meet.stratton <- unique(employ.hist[icrdn %in%
>> stratton.people,list(fcrdn,date=fromdate:todate)])
>>
>> (2) Find people who meet the conditions:
>>
>> setkey(employ.hist,fcrdn)
>> met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate &
>> date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]
>>
>> (3) If you want to exclude Stratton folks, then use setdiff()
>>
>> --Frank
>>
>> On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <[email protected]>
>> wrote:
>>
>>> There's particular problem I often have, and I'm hoping someone can tell
>>> me how to speed it up in data.table.  It seems to involve a sort of
>>> recursion that data.table (as I'm using it) doesn't do well with, where for
>>> each record in a set, I do a another search within the same table.  I hope
>>> the formatting of the code below is legible--it's a lot easier to read in
>>> the RStudio text editor!
>>>
>>> I have a moderately large (more than 3 million rows) data.table of the
>>> employment histories of brokers in the US.  Each row is an employment
>>> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a
>>> branch identifier (branch), start and end dates (fromdate and todate), and
>>> a few other items (each row has a unique id as well, called job.index).
>>> For example, finding all the brokers that ever worked at Stratton Oakmont
>>> (from the Wolf of Wall Street):
>>>
>>> employ.hist[fcrdn == 18692, icrdn]
>>>
>>> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is
>>> the individual identifier.
>>>
>>> What I want is to find all the individuals that ever met a Stratton
>>> alum.  Specifically, every icrdn such that the branch == a branch a
>>> Stratton alum ever worked at and the start and end dates overlap.  The only
>>> way I've found to do so involves something like this:
>>>
>>> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {
>>>   employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,
>>>               list(icrdn, branch, job.index, fcrdn)]
>>> })
>>>
>>> stratton.people <- employ.hist[fcrdn == 18692, icrdn]
>>> stratton.contacts <- employ.hist[icrdn %in% stratton.people,
>>>                                  find_brokers_by_single_branch(fromdate,
>>> todate, branch),
>>>                                  by = "job.index"]
>>>
>>> This works, but effectively means calling the data.table '[' function
>>> thousands of times, once for each job entry
>>> a Stratton broker ever had (which are in the thousands, as many left
>>> before the government busted the place
>>> and are still in the industry).  It's quite slow, and I'm hoping someone
>>> can show me a way to speed it up, as I have
>>> many similar tasks, some of which are vastly larger.  Memory really
>>> isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770
>>> 3.4GHz), in case that helps.
>>>
>>> -------
>>> Nathaniel Graham
>>> [email protected]
>>> [email protected]
>>> https://sites.google.com/site/npgraham1/
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [email protected]
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to