So, I created a WIP PR dedicated to NCA: https://github.com/scikit-learn/scikit-learn/pull/4789
As suggested by Michael, I refactored "the meat" into a function. I also rewrote it as a first order oracle, so I can (and I do) use scipy's optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS) sometimes stopping at some weird point (a local minimum / saddle point?), whereas gradient descent seems to always converge. Though, I didn't test either of them extensively. I also fully vectorized function and gradient calculations, no loops involved. So, what's the consensus on benchmarks? I can share ipython notebooks via gist, for example. On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg < michael.eickenb...@gmail.com> wrote: > Hi Aurélien, > > thanks for these very good pointers! > (Now we also know who else to bug periodically for opinions ;)) > > Michael > > > On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet < > aurelien.bel...@telecom-paristech.fr> wrote: > >> Hi everyone, >> >> A few additional things to consider for scaling-up NCA to large datasets: >> >> - Take a look at the t-SNE (technique for visualization/dim reduction >> very similar to NCA) implementations, I think they have a few speed-up >> tricks that you could potentially re-use: >> http://lvdmaaten.github.io/tsne/ >> >> - Like you said, SGD can help reduce the computational cost - you could >> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc. >> >> - Similarly to what was suggested in previous replies, a general idea is >> to only consider a neighborhood around each point (either fixed in >> advance, or updated every now and then during the course of >> optimization), since the probabilities decrease very fast with the >> distance so farther points can be safely ignored in the computation. >> This is explored for instance in: >> http://dl.acm.org/citation.cfm?id=2142432 >> >> - Another related idea is to construct class representatives (for >> instance using k-means), and to model the distribution only wrt these >> points instead of the entire dataset. This is especially useful if some >> classes are very large. An extreme version of this is to reframe NCA for >> a Nearest Class Mean classifier, where each class is only modeled by its >> center: >> >> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf >> >> Hope this helps. >> >> Aurelien >> >> Le 5/28/15 11:20 PM, Andreas Mueller a écrit : >> > >> > >> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote: >> >> >> >> Code-wise, I would attack the problem as a function first. Write a >> >> function that takes X and y (plus maybe some options) and gives back >> >> L. You can put a skeleton of a sklearn estimator around it by calling >> >> this function from fit. >> >> Please keep your code either in a sklearn WIP PR or a public gist, so >> >> it can be reviewed. Writing benchmarks can be framed as writing >> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment >> >> on how benchmarks have been handled in the past?). >> >> >> > There is a "benchmark" folder, which is in a horrible shape. >> > Basically there are three ways to do it: examples (with or without plot >> > depending on the runtime), a script in the benchmark folder, or a gist. >> > Often we just use a gist and the PR person posts the output. Not that >> > great for reproducibility, though. >> > >> > >> ------------------------------------------------------------------------------ >> > _______________________________________________ >> > Scikit-learn-general mailing list >> > Scikit-learn-general@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general