Re: [Scikit-learn-general] [GSoC2015 metric learning]

Michael Eickenberg Sun, 31 May 2015 11:30:07 -0700

On Sun, May 31, 2015 at 7:25 PM, Artem <barmaley....@gmail.com> wrote:


> I added a simple benchmark
> <https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py>
>  that
> compares NCA-assisted 1NN with default one (with Euclidean distance) on
> Wine dataset (it was one of the datasets reported in the NCA paper). See my
> output here <https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd>.
>
> It also compares semivectorized and vectorized implementations:
> surprisingly, semivectorizes is about 2 times faster. I think, this might
> be a reason to throw fully vectorized (nca_vectorized_oracle)
> implementation away.
>

This is very interesting. I like the way you made the semi-vectorized way
and this seems to show that you are still benefitting from the remaining
vectorizations you have.
However, I am surprised to see such a difference between the two methods.
Do you have any idea why this could be the case? Where do you think the
vectorized version looses all this time?




>
> On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
>> Thanks for the update.
>>
>> > So, what's the consensus on benchmarks? I can share ipython notebooks
>> via gist, for example.
>>
>> My (weak) preference would be to have a script within the sklearn repo,
>> just to keep stuff in one place for easy future reference.
>>
>> Michael
>>
>>
>> On Fri, May 29, 2015 at 6:24 PM, Artem <barmaley....@gmail.com> wrote:
>>
>>> So, I created a WIP PR dedicated to NCA:
>>> https://github.com/scikit-learn/scikit-learn/pull/4789
>>>
>>> As suggested by Michael, I refactored "the meat" into a function. I also
>>> rewrote it as a first order oracle, so I can (and I do) use scipy's
>>> optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
>>> sometimes stopping at some weird point (a local minimum / saddle point?),
>>> whereas gradient descent seems to always converge. Though, I didn't test
>>> either of them extensively.
>>>
>>> I also fully vectorized function and gradient calculations, no loops
>>> involved.
>>>
>>> So, what's the consensus on benchmarks? I can share ipython notebooks
>>> via gist, for example.
>>>
>>> On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
>>> michael.eickenb...@gmail.com> wrote:
>>>
>>>> Hi Aurélien,
>>>>
>>>> thanks for these very good pointers!
>>>> (Now we also know who else to bug periodically for opinions ;))
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
>>>> aurelien.bel...@telecom-paristech.fr> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> A few additional things to consider for scaling-up NCA to large
>>>>> datasets:
>>>>>
>>>>> - Take a look at the t-SNE (technique for visualization/dim reduction
>>>>> very similar to NCA) implementations, I think they have a few speed-up
>>>>> tricks that you could potentially re-use:
>>>>> http://lvdmaaten.github.io/tsne/
>>>>>
>>>>> - Like you said, SGD can help reduce the computational cost - you could
>>>>> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
>>>>>
>>>>> - Similarly to what was suggested in previous replies, a general idea
>>>>> is
>>>>> to only consider a neighborhood around each point (either fixed in
>>>>> advance, or updated every now and then during the course of
>>>>> optimization), since the probabilities decrease very fast with the
>>>>> distance so farther points can be safely ignored in the computation.
>>>>> This is explored for instance in:
>>>>> http://dl.acm.org/citation.cfm?id=2142432
>>>>>
>>>>> - Another related idea is to construct class representatives (for
>>>>> instance using k-means), and to model the distribution only wrt these
>>>>> points instead of the entire dataset. This is especially useful if some
>>>>> classes are very large. An extreme version of this is to reframe NCA
>>>>> for
>>>>> a Nearest Class Mean classifier, where each class is only modeled by
>>>>> its
>>>>> center:
>>>>>
>>>>> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Aurelien
>>>>>
>>>>> Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
>>>>> >
>>>>> >
>>>>> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
>>>>> >>
>>>>> >> Code-wise, I would attack the problem as a function first. Write a
>>>>> >> function that takes X and y (plus maybe some options) and gives back
>>>>> >> L. You can put a skeleton of a sklearn estimator around it by
>>>>> calling
>>>>> >> this function from fit.
>>>>> >> Please keep your code either in a sklearn WIP PR or a public gist,
>>>>> so
>>>>> >> it can be reviewed. Writing benchmarks can be framed as writing
>>>>> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a
>>>>> comment
>>>>> >> on how benchmarks have been handled in the past?).
>>>>> >>
>>>>> > There is a "benchmark" folder, which is in a horrible shape.
>>>>> > Basically there are three ways to do it: examples (with or without
>>>>> plot
>>>>> > depending on the runtime), a script in the benchmark folder, or a
>>>>> gist.
>>>>> > Often we just use a gist and the PR person posts the output. Not that
>>>>> > great for reproducibility, though.
>>>>> >
>>>>> >
>>>>> ------------------------------------------------------------------------------
>>>>> > _______________________________________________
>>>>> > Scikit-learn-general mailing list
>>>>> > Scikit-learn-general@lists.sourceforge.net
>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC2015 metric learning]

Reply via email to