Re: [Scikit-learn-general] [GSoC2015 metric learning]

Artem Sun, 31 May 2015 17:20:07 -0700

Apparently, the problem was with multiplying big 4D tensors:

grad = np.sum(proba[:, :, np.newaxis, np.newaxis] * outer, axis=1).sum(axis=
0)
I don't know the details, but line-prof said that most of the time was
spent here (and the other similar line), probably it was due to copies. I
recalled your words on np.einsum, replaced sums with it and got quite a
speedup (x4)!


New output log: link
<https://gist.github.com/Barmaley-exe/d5ca99d7b94fb665ad03>. Now fully
vectorized version takes ~2s per iteration and is faster than
semivectorized one.

P.S. I made the same optimization in semivectorized version, but its
speedup isn't that significant.

On Sun, May 31, 2015 at 9:29 PM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:

>
>
> On Sun, May 31, 2015 at 7:25 PM, Artem <barmaley....@gmail.com> wrote:
>
>> I added a simple benchmark
>> <https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py>
>>  that
>> compares NCA-assisted 1NN with default one (with Euclidean distance) on
>> Wine dataset (it was one of the datasets reported in the NCA paper). See my
>> output here <https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd>.
>>
>> It also compares semivectorized and vectorized implementations:
>> surprisingly, semivectorizes is about 2 times faster. I think, this might
>> be a reason to throw fully vectorized (nca_vectorized_oracle)
>> implementation away.
>>
>
> This is very interesting. I like the way you made the semi-vectorized way
> and this seems to show that you are still benefitting from the remaining
> vectorizations you have.
> However, I am surprised to see such a difference between the two methods.
> Do you have any idea why this could be the case? Where do you think the
> vectorized version looses all this time?
>
>
>
>
>>
>> On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg <
>> michael.eickenb...@gmail.com> wrote:
>>
>>> Thanks for the update.
>>>
>>> > So, what's the consensus on benchmarks? I can share ipython notebooks
>>> via gist, for example.
>>>
>>> My (weak) preference would be to have a script within the sklearn repo,
>>> just to keep stuff in one place for easy future reference.
>>>
>>> Michael
>>>
>>>
>>> On Fri, May 29, 2015 at 6:24 PM, Artem <barmaley....@gmail.com> wrote:
>>>
>>>> So, I created a WIP PR dedicated to NCA:
>>>> https://github.com/scikit-learn/scikit-learn/pull/4789
>>>>
>>>> As suggested by Michael, I refactored "the meat" into a function. I
>>>> also rewrote it as a first order oracle, so I can (and I do) use scipy's
>>>> optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
>>>> sometimes stopping at some weird point (a local minimum / saddle point?),
>>>> whereas gradient descent seems to always converge. Though, I didn't test
>>>> either of them extensively.
>>>>
>>>> I also fully vectorized function and gradient calculations, no loops
>>>> involved.
>>>>
>>>> So, what's the consensus on benchmarks? I can share ipython notebooks
>>>> via gist, for example.
>>>>
>>>> On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
>>>> michael.eickenb...@gmail.com> wrote:
>>>>
>>>>> Hi Aurélien,
>>>>>
>>>>> thanks for these very good pointers!
>>>>> (Now we also know who else to bug periodically for opinions ;))
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
>>>>> aurelien.bel...@telecom-paristech.fr> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> A few additional things to consider for scaling-up NCA to large
>>>>>> datasets:
>>>>>>
>>>>>> - Take a look at the t-SNE (technique for visualization/dim reduction
>>>>>> very similar to NCA) implementations, I think they have a few speed-up
>>>>>> tricks that you could potentially re-use:
>>>>>> http://lvdmaaten.github.io/tsne/
>>>>>>
>>>>>> - Like you said, SGD can help reduce the computational cost - you
>>>>>> could
>>>>>> also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
>>>>>>
>>>>>> - Similarly to what was suggested in previous replies, a general idea
>>>>>> is
>>>>>> to only consider a neighborhood around each point (either fixed in
>>>>>> advance, or updated every now and then during the course of
>>>>>> optimization), since the probabilities decrease very fast with the
>>>>>> distance so farther points can be safely ignored in the computation.
>>>>>> This is explored for instance in:
>>>>>> http://dl.acm.org/citation.cfm?id=2142432
>>>>>>
>>>>>> - Another related idea is to construct class representatives (for
>>>>>> instance using k-means), and to model the distribution only wrt these
>>>>>> points instead of the entire dataset. This is especially useful if
>>>>>> some
>>>>>> classes are very large. An extreme version of this is to reframe NCA
>>>>>> for
>>>>>> a Nearest Class Mean classifier, where each class is only modeled by
>>>>>> its
>>>>>> center:
>>>>>>
>>>>>> https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Aurelien
>>>>>>
>>>>>> Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
>>>>>> >
>>>>>> >
>>>>>> > On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
>>>>>> >>
>>>>>> >> Code-wise, I would attack the problem as a function first. Write a
>>>>>> >> function that takes X and y (plus maybe some options) and gives
>>>>>> back
>>>>>> >> L. You can put a skeleton of a sklearn estimator around it by
>>>>>> calling
>>>>>> >> this function from fit.
>>>>>> >> Please keep your code either in a sklearn WIP PR or a public gist,
>>>>>> so
>>>>>> >> it can be reviewed. Writing benchmarks can be framed as writing
>>>>>> >> examples, i.e. plot_* functions (maybe Andy or Olivier have a
>>>>>> comment
>>>>>> >> on how benchmarks have been handled in the past?).
>>>>>> >>
>>>>>> > There is a "benchmark" folder, which is in a horrible shape.
>>>>>> > Basically there are three ways to do it: examples (with or without
>>>>>> plot
>>>>>> > depending on the runtime), a script in the benchmark folder, or a
>>>>>> gist.
>>>>>> > Often we just use a gist and the PR person posts the output. Not
>>>>>> that
>>>>>> > great for reproducibility, though.
>>>>>> >
>>>>>> >
>>>>>> ------------------------------------------------------------------------------
>>>>>> > _______________________________________________
>>>>>> > Scikit-learn-general mailing list
>>>>>> > Scikit-learn-general@lists.sourceforge.net
>>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC2015 metric learning]

Reply via email to