Hi Jake, Thank you for the feedback. Yeah, working without replacement, certain cases are going to more appropriate matches than others. I proposed the idea of using replacement and compensating for the re-use of controls with frequency weighting, but you gotta do what your PI tells you sometimes! :P
Best, Randy On Mon, Apr 2, 2018 at 2:15 PM, Jacob Vanderplas <jake...@cs.washington.edu> wrote: > Hi Randy, > I think that approach is probably a good heuristic, but it will not > necessarily find the optimal result. That said, if you don't care about > having guarantees that you're finding the optimal pairing, but only that > you can find a reasonable set of pairs, it will probably work out fine. > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis <randalljel...@gmail.com> > wrote: > >> Hi Jake, >> >> Thanks for the reply. Yes, trying this out resulted from looking for ways >> in python to implement propensity score matching. I found a package, >> pscore_match (http://www.kellieottoboni.com/pscore_match/), but the >> matching was really terrible. Specifically, I'm matching based on age, >> race, gender, HIV status, hepatitis C status, and sickle-cell disease >> status. Using NearestNeighbors for matching performed WAY better, I was so >> surprised at how well every factor was matched for. The only issue is that >> it uses replacement. >> >> Here's what I'm currently testing. I need each case to match to 20 >> controls, so since NearestNeighbors uses replacement, I'm matching each >> case to many controls (15000), taking all of the distances for all of the >> pairs, and retaining only the smallest distances for each control. Since >> many controls are re-used (since the algorithm uses replacement), the hope >> is that enough controls are matched to many different cases so that each >> case ends up being matched to 20 unique controls. Does this method make >> sense?? >> >> Best, >> >> Randy >> >> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < >> jake...@cs.washington.edu> wrote: >> >>> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljel...@gmail.com> >>> wrote: >>> >>>> Hello to the Scikit-learn community! >>>> >>>> I am doing case-control matching for an electronic health records >>>> study. My question is, is it possible to run Sklearn's NearestNeighbors >>>> function without replacement? As in, match the treated group to the >>>> untreated group without re-using any of the untreated group data points? If >>>> so, how? By default, it uses replacement. I know this because I tested it >>>> on some data of mine. >>>> >>>> The code I used is in the confirmed answer here: >>>> https://stats.stackexchange.com/questions/206832/matched-pai >>>> rs-in-python-propensity-score-matching >>>> >>>> Thanks so much in advance, >>>> >>> >>> No, pairwise matching without replacement is not implemented within >>> scikit-learn's nearest neighbors routines. >>> >>> It seems like an algorithm you would have to think carefully about >>> because the number of potential pairs grows exponentially with the number >>> of points, and I don't think it's true that choosing the nearest available >>> neighbor of points in sequence will guarantee you to find the optimal >>> configuration. You'd also have to carefully define what you mean by >>> "optimal"... are you seeking to minimize the sum of all distances? The sum >>> of squared distances? The maximum distance? The results would change >>> depending on the metric you define. And you'd probably have to figure out >>> some way to reduce the exponential search space in order to calculate the >>> result in a reasonable amount of time for your data. >>> >>> You might look into the literature on propensity score matching; I think >>> that's one area where this kind of neighbors-without-replacement algorithm >>> is often used. >>> >>> Best, >>> Jake >>> >>> >>>> >>>> -- >>>> *Randall J. Ellis, B.S.* >>>> PhD Student, Biomedical Science, Mount Sinai >>>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >>>> Cell: (954)-260-9891 <(954)%20260-9891> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> *Randall J. Ellis, B.S.* >> PhD Student, Biomedical Science, Mount Sinai >> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >> Cell: (954)-260-9891 <(954)%20260-9891> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn