Thanks for the update. > So, what's the consensus on benchmarks? I can share ipython notebooks via gist, for example.
My (weak) preference would be to have a script within the sklearn repo, just to keep stuff in one place for easy future reference. Michael On Fri, May 29, 2015 at 6:24 PM, Artem <barmaley....@gmail.com> wrote: > So, I created a WIP PR dedicated to NCA: > https://github.com/scikit-learn/scikit-learn/pull/4789 > > As suggested by Michael, I refactored "the meat" into a function. I also > rewrote it as a first order oracle, so I can (and I do) use scipy's > optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS) > sometimes stopping at some weird point (a local minimum / saddle point?), > whereas gradient descent seems to always converge. Though, I didn't test > either of them extensively. > > I also fully vectorized function and gradient calculations, no loops > involved. > > So, what's the consensus on benchmarks? I can share ipython notebooks via > gist, for example. > > On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg < > michael.eickenb...@gmail.com> wrote: > >> Hi Aurélien, >> >> thanks for these very good pointers! >> (Now we also know who else to bug periodically for opinions ;)) >> >> Michael >> >> >> On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet < >> aurelien.bel...@telecom-paristech.fr> wrote: >> >>> Hi everyone, >>> >>> A few additional things to consider for scaling-up NCA to large datasets: >>> >>> - Take a look at the t-SNE (technique for visualization/dim reduction >>> very similar to NCA) implementations, I think they have a few speed-up >>> tricks that you could potentially re-use: >>> http://lvdmaaten.github.io/tsne/ >>> >>> - Like you said, SGD can help reduce the computational cost - you could >>> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc. >>> >>> - Similarly to what was suggested in previous replies, a general idea is >>> to only consider a neighborhood around each point (either fixed in >>> advance, or updated every now and then during the course of >>> optimization), since the probabilities decrease very fast with the >>> distance so farther points can be safely ignored in the computation. >>> This is explored for instance in: >>> http://dl.acm.org/citation.cfm?id=2142432 >>> >>> - Another related idea is to construct class representatives (for >>> instance using k-means), and to model the distribution only wrt these >>> points instead of the entire dataset. This is especially useful if some >>> classes are very large. An extreme version of this is to reframe NCA for >>> a Nearest Class Mean classifier, where each class is only modeled by its >>> center: >>> >>> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf >>> >>> Hope this helps. >>> >>> Aurelien >>> >>> Le 5/28/15 11:20 PM, Andreas Mueller a écrit : >>> > >>> > >>> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote: >>> >> >>> >> Code-wise, I would attack the problem as a function first. Write a >>> >> function that takes X and y (plus maybe some options) and gives back >>> >> L. You can put a skeleton of a sklearn estimator around it by calling >>> >> this function from fit. >>> >> Please keep your code either in a sklearn WIP PR or a public gist, so >>> >> it can be reviewed. Writing benchmarks can be framed as writing >>> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment >>> >> on how benchmarks have been handled in the past?). >>> >> >>> > There is a "benchmark" folder, which is in a horrible shape. >>> > Basically there are three ways to do it: examples (with or without plot >>> > depending on the runtime), a script in the benchmark folder, or a gist. >>> > Often we just use a gist and the PR person posts the output. Not that >>> > great for reproducibility, though. >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > _______________________________________________ >>> > Scikit-learn-general mailing list >>> > Scikit-learn-general@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> > >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general