Hi Aurélien,

thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))

Michael


On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
aurelien.bel...@telecom-paristech.fr> wrote:

> Hi everyone,
>
> A few additional things to consider for scaling-up NCA to large datasets:
>
> - Take a look at the t-SNE (technique for visualization/dim reduction
> very similar to NCA) implementations, I think they have a few speed-up
> tricks that you could potentially re-use:
> http://lvdmaaten.github.io/tsne/
>
> - Like you said, SGD can help reduce the computational cost - you could
> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
>
> - Similarly to what was suggested in previous replies, a general idea is
> to only consider a neighborhood around each point (either fixed in
> advance, or updated every now and then during the course of
> optimization), since the probabilities decrease very fast with the
> distance so farther points can be safely ignored in the computation.
> This is explored for instance in:
> http://dl.acm.org/citation.cfm?id=2142432
>
> - Another related idea is to construct class representatives (for
> instance using k-means), and to model the distribution only wrt these
> points instead of the entire dataset. This is especially useful if some
> classes are very large. An extreme version of this is to reframe NCA for
> a Nearest Class Mean classifier, where each class is only modeled by its
> center:
>
> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
>
> Hope this helps.
>
> Aurelien
>
> Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
> >
> >
> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
> >>
> >> Code-wise, I would attack the problem as a function first. Write a
> >> function that takes X and y (plus maybe some options) and gives back
> >> L. You can put a skeleton of a sklearn estimator around it by calling
> >> this function from fit.
> >> Please keep your code either in a sklearn WIP PR or a public gist, so
> >> it can be reviewed. Writing benchmarks can be framed as writing
> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
> >> on how benchmarks have been handled in the past?).
> >>
> > There is a "benchmark" folder, which is in a horrible shape.
> > Basically there are three ways to do it: examples (with or without plot
> > depending on the runtime), a script in the benchmark folder, or a gist.
> > Often we just use a gist and the PR person posts the output. Not that
> > great for reproducibility, though.
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to