Hi everyone,

A few additional things to consider for scaling-up NCA to large datasets:

- Take a look at the t-SNE (technique for visualization/dim reduction 
very similar to NCA) implementations, I think they have a few speed-up 
tricks that you could potentially re-use:
http://lvdmaaten.github.io/tsne/

- Like you said, SGD can help reduce the computational cost - you could 
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.

- Similarly to what was suggested in previous replies, a general idea is 
to only consider a neighborhood around each point (either fixed in 
advance, or updated every now and then during the course of 
optimization), since the probabilities decrease very fast with the 
distance so farther points can be safely ignored in the computation. 
This is explored for instance in:
http://dl.acm.org/citation.cfm?id=2142432

- Another related idea is to construct class representatives (for 
instance using k-means), and to model the distribution only wrt these 
points instead of the entire dataset. This is especially useful if some 
classes are very large. An extreme version of this is to reframe NCA for 
a Nearest Class Mean classifier, where each class is only modeled by its 
center:
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf

Hope this helps.

Aurelien

Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
>
>
> On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
>>
>> Code-wise, I would attack the problem as a function first. Write a
>> function that takes X and y (plus maybe some options) and gives back
>> L. You can put a skeleton of a sklearn estimator around it by calling
>> this function from fit.
>> Please keep your code either in a sklearn WIP PR or a public gist, so
>> it can be reviewed. Writing benchmarks can be framed as writing
>> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
>> on how benchmarks have been handled in the past?).
>>
> There is a "benchmark" folder, which is in a horrible shape.
> Basically there are three ways to do it: examples (with or without plot
> depending on the runtime), a script in the benchmark folder, or a gist.
> Often we just use a gist and the PR person posts the output. Not that
> great for reproducibility, though.
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to