So, I created a WIP PR dedicated to NCA:
https://github.com/scikit-learn/scikit-learn/pull/4789

As suggested by Michael, I refactored "the meat" into a function. I also
rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.

I also fully vectorized function and gradient calculations, no loops
involved.

So, what's the consensus on benchmarks? I can share ipython notebooks via
gist, for example.

On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:

> Hi Aurélien,
>
> thanks for these very good pointers!
> (Now we also know who else to bug periodically for opinions ;))
>
> Michael
>
>
> On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
> aurelien.bel...@telecom-paristech.fr> wrote:
>
>> Hi everyone,
>>
>> A few additional things to consider for scaling-up NCA to large datasets:
>>
>> - Take a look at the t-SNE (technique for visualization/dim reduction
>> very similar to NCA) implementations, I think they have a few speed-up
>> tricks that you could potentially re-use:
>> http://lvdmaaten.github.io/tsne/
>>
>> - Like you said, SGD can help reduce the computational cost - you could
>> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
>>
>> - Similarly to what was suggested in previous replies, a general idea is
>> to only consider a neighborhood around each point (either fixed in
>> advance, or updated every now and then during the course of
>> optimization), since the probabilities decrease very fast with the
>> distance so farther points can be safely ignored in the computation.
>> This is explored for instance in:
>> http://dl.acm.org/citation.cfm?id=2142432
>>
>> - Another related idea is to construct class representatives (for
>> instance using k-means), and to model the distribution only wrt these
>> points instead of the entire dataset. This is especially useful if some
>> classes are very large. An extreme version of this is to reframe NCA for
>> a Nearest Class Mean classifier, where each class is only modeled by its
>> center:
>>
>> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
>>
>> Hope this helps.
>>
>> Aurelien
>>
>> Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
>> >
>> >
>> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
>> >>
>> >> Code-wise, I would attack the problem as a function first. Write a
>> >> function that takes X and y (plus maybe some options) and gives back
>> >> L. You can put a skeleton of a sklearn estimator around it by calling
>> >> this function from fit.
>> >> Please keep your code either in a sklearn WIP PR or a public gist, so
>> >> it can be reviewed. Writing benchmarks can be framed as writing
>> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
>> >> on how benchmarks have been handled in the past?).
>> >>
>> > There is a "benchmark" folder, which is in a horrible shape.
>> > Basically there are three ways to do it: examples (with or without plot
>> > depending on the runtime), a script in the benchmark folder, or a gist.
>> > Often we just use a gist and the PR person posts the output. Not that
>> > great for reproducibility, though.
>> >
>> >
>> ------------------------------------------------------------------------------
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to