Thanks Dr. Varoquax, it’s awesome you’re on this list, I’m a fan of your work!
Will look into this strategy. Best, Randy On Tue, Apr 3, 2018 at 8:57 AM Gael Varoquaux <gael.varoqu...@normalesup.org> wrote: > Matching to minimize a cost is known as the linear assignment problem, > can be solved in n^3 cost, and is implemented in scikit-learn in > sklearn.utils.linear_assignment_.linear_assignment or in recent versions > of scipy as scipy.optimize.linear_sum_assignment > > Of course, this problem will require much more coding (you need to build > your pairwise cost matrix) and much more computing cost (n^3 instead of > n^2) than a standard nearest-neighbor. > > Gaël > > On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote: > > Hi Jake, > > > Thanks for the reply. Yes, trying this out resulted from looking for > ways in > > python to implement propensity score matching. I found a package, > pscore_match > > (http://www.kellieottoboni.com/pscore_match/), but the matching was > really > > terrible. Specifically, I'm matching based on age, race, gender, HIV > status, > > hepatitis C status, and sickle-cell disease status. Using > NearestNeighbors for > > matching performed WAY better, I was so surprised at how well every > factor was > > matched for. The only issue is that it uses replacement. > > > Here's what I'm currently testing. I need each case to match to 20 > controls, so > > since NearestNeighbors uses replacement, I'm matching each case to many > > controls (15000), taking all of the distances for all of the pairs, and > > retaining only the smallest distances for each control. Since many > controls are > > re-used (since the algorithm uses replacement), the hope is that enough > > controls are matched to many different cases so that each case ends up > being > > matched to 20 unique controls. Does this method make sense?? > > > Best, > > > Randy > > > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < > jake...@cs.washington.edu> > > wrote: > > > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljel...@gmail.com > > > > wrote: > > > Hello to the Scikit-learn community! > > > I am doing case-control matching for an electronic health records > > study. My question is, is it possible to run Sklearn's > NearestNeighbors > > function without replacement? As in, match the treated group to > the > > untreated group without re-using any of the untreated group data > > points? If so, how? By default, it uses replacement. I know this > > because I tested it on some data of mine. > > > The code I used is in the confirmed answer here: https:// > > stats.stackexchange.com/questions/206832/matched-pai > > rs-in-python-propensity-score-matching > > > Thanks so much in advance, > > > > No, pairwise matching without replacement is not implemented within > > scikit-learn's nearest neighbors routines. > > > It seems like an algorithm you would have to think carefully about > because > > the number of potential pairs grows exponentially with the number of > > points, and I don't think it's true that choosing the nearest > available > > neighbor of points in sequence will guarantee you to find the optimal > > configuration. You'd also have to carefully define what you mean by > > "optimal"... are you seeking to minimize the sum of all distances? > The sum > > of squared distances? The maximum distance? The results would change > > depending on the metric you define. And you'd probably have to > figure out > > some way to reduce the exponential search space in order to > calculate the > > result in a reasonable amount of time for your data. > > > You might look into the literature on propensity score matching; I > think > > that's one area where this kind of neighbors-without-replacement > algorithm > > is often used. > > > Best, > > Jake > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn