Hi Aurélien, thanks for these very good pointers! (Now we also know who else to bug periodically for opinions ;))
Michael On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet < aurelien.bel...@telecom-paristech.fr> wrote: > Hi everyone, > > A few additional things to consider for scaling-up NCA to large datasets: > > - Take a look at the t-SNE (technique for visualization/dim reduction > very similar to NCA) implementations, I think they have a few speed-up > tricks that you could potentially re-use: > http://lvdmaaten.github.io/tsne/ > > - Like you said, SGD can help reduce the computational cost - you could > also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc. > > - Similarly to what was suggested in previous replies, a general idea is > to only consider a neighborhood around each point (either fixed in > advance, or updated every now and then during the course of > optimization), since the probabilities decrease very fast with the > distance so farther points can be safely ignored in the computation. > This is explored for instance in: > http://dl.acm.org/citation.cfm?id=2142432 > > - Another related idea is to construct class representatives (for > instance using k-means), and to model the distribution only wrt these > points instead of the entire dataset. This is especially useful if some > classes are very large. An extreme version of this is to reframe NCA for > a Nearest Class Mean classifier, where each class is only modeled by its > center: > > https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf > > Hope this helps. > > Aurelien > > Le 5/28/15 11:20 PM, Andreas Mueller a écrit : > > > > > > On 05/28/2015 05:11 PM, Michael Eickenberg wrote: > >> > >> Code-wise, I would attack the problem as a function first. Write a > >> function that takes X and y (plus maybe some options) and gives back > >> L. You can put a skeleton of a sklearn estimator around it by calling > >> this function from fit. > >> Please keep your code either in a sklearn WIP PR or a public gist, so > >> it can be reviewed. Writing benchmarks can be framed as writing > >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment > >> on how benchmarks have been handled in the past?). > >> > > There is a "benchmark" folder, which is in a horrible shape. > > Basically there are three ways to do it: examples (with or without plot > > depending on the runtime), a script in the benchmark folder, or a gist. > > Often we just use a gist and the PR person posts the output. Not that > > great for reproducibility, though. > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general