Hi everyone, A few additional things to consider for scaling-up NCA to large datasets:
- Take a look at the t-SNE (technique for visualization/dim reduction very similar to NCA) implementations, I think they have a few speed-up tricks that you could potentially re-use: http://lvdmaaten.github.io/tsne/ - Like you said, SGD can help reduce the computational cost - you could also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc. - Similarly to what was suggested in previous replies, a general idea is to only consider a neighborhood around each point (either fixed in advance, or updated every now and then during the course of optimization), since the probabilities decrease very fast with the distance so farther points can be safely ignored in the computation. This is explored for instance in: http://dl.acm.org/citation.cfm?id=2142432 - Another related idea is to construct class representatives (for instance using k-means), and to model the distribution only wrt these points instead of the entire dataset. This is especially useful if some classes are very large. An extreme version of this is to reframe NCA for a Nearest Class Mean classifier, where each class is only modeled by its center: https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf Hope this helps. Aurelien Le 5/28/15 11:20 PM, Andreas Mueller a écrit : > > > On 05/28/2015 05:11 PM, Michael Eickenberg wrote: >> >> Code-wise, I would attack the problem as a function first. Write a >> function that takes X and y (plus maybe some options) and gives back >> L. You can put a skeleton of a sklearn estimator around it by calling >> this function from fit. >> Please keep your code either in a sklearn WIP PR or a public gist, so >> it can be reviewed. Writing benchmarks can be framed as writing >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment >> on how benchmarks have been handled in the past?). >> > There is a "benchmark" folder, which is in a horrible shape. > Basically there are three ways to do it: examples (with or without plot > depending on the runtime), a script in the benchmark folder, or a gist. > Often we just use a gist and the PR person posts the output. Not that > great for reproducibility, though. > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general