Hi Artem.
Thanks for sharing the post.
For 1.: Always start without Cython. Then do profiling to identify the bottlenecks. Then we can talk about if the bottleneck can be helped with Cython. For this kind of algorithm, I'd expect the bottleneck to be in some matrix multiplication (though I don't know the actual details), so Cython is unlikely to help.

For the rest I have to look more into it.

Cheers,
Andy



On 05/28/2015 06:25 AM, Artem wrote:
Hello everyone

I finally published my API blogpost <http://barmaley-exe.blogspot.ru/2015/05/api-design.html>. It mostly summarizes ideas discussed during project proposal stage.

For now I'd like to discuss NCA implementation. I already wrote <http://barmaley-exe.blogspot.ru/2015/05/nca.html> a small prototype that works, but there are issues I want to talk about: 1. Efficiency. Should I aim for Cython implementation, or just having everything vectorized is fine? In general, some rule of thumb like when to use Cython would be nice. 2. Memory issues. If we cache outer products, there might be not enough memory. What's your opinion on my idea (see the blogpost) of a cache of fixed size? There's cache_size parameter for SVC, but I haven't looked into that yet. Is it of any help to this problem? 3. Optimization. At the moment I use plain batch gradient ascent with a fixed learning rate.

  * I think it'd better to use something more advanced like Momentum /
    AdaGrad (line search seems too impractical)
  * Another point to consider is stochastic (minibatch?) gradient ascent.
  * What do you think about "cell-wise" optimization (fixing all cells
    of matrix but one)?
  * Or maybe it's enough to use scipy's conjugate gradients optimizer?


On Mon, May 4, 2015 at 2:02 PM, Michael Eickenberg <michael.eickenb...@gmail.com <mailto:michael.eickenb...@gmail.com>> wrote:

    Dear Artem,

    congratulations on the acceptance of your GSoC proposal! I am
    certain there will be a very interesting summer ahead of us. Kyle
    and I are excited to be mentors and will do our best to provide
    all the guidance necessary for your project to succeed. It is very
    rich and will be a great addition to the codebase.

    Your blog post
    <http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the
    gists of the methods is written in a very understandable way and
    permits a good overview of the topics you are going to address in
    depth. It shows that you have the right intuitions, and are ready
    to delve into the intricacies of the methods [1]. Take advantage
    of the next weeks to do so! Let's make sure we hit the ground
    running at the end of this warm-up phase.

    As for your next plans, sketching the algorithms in very high
    level pseudo-code is of course an excellent idea and can be a next
    blog post.
    After this, you can zoom in on the details of how each pseudo-code
    step can be implemented. If you get the level of detail right, I
    recommend the Python language to describe your algorithms ;) --
    what I mean is that getting a minimal version of the algorithm to
    work, just as a function, not a sklearn estimator, is a valuable
    baseline to have, and it usually deepens the understanding as well.

    As for the API questions, it is of course quite essential to
    remain conscious at all times of the issues that have been
    identified in prior discussion and to think of ways to add a
    metric learning module without succumbing to excessive feature
    creep. My hunch is that given some working minimal versions of the
    algorithms, we can perhaps crystallize out what is absolutely
    necessary in terms of additions, so I would prefer that order of
    priorities. There is also some work to be done in identifying
    other parts of scikit-learn that already deal with
    (dis-)similarity type data (cf eg the kernels defined in the PR
    for gaussian processes) and see how these can be made to work in a
    consistent way.

    A crucial aspect that we need to figure out is "what is a good
    outcome?": Of course we would like to have some PRs merged at the
    end of summer, yes. But what makes a concrete implementation good?
    Speed (with respect to what)? Readability? Maintainability (yes
    please!)? Elegance (what does that even mean?)?

    It may be helpful if you could devise a more fine-grained timeline
    
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
    for the community bonding period than what is currently stated on
    the wiki page. How about blogging your progress in understanding?
    Writing things down for others to understand is a very good way of
    identifying any doubts you may have on particular aspects. A mini
    blog-entry at the end of each week simply recounting what has been
    done and also what has been discussed will also yield an effective
    overview of ongoing topics.

    In the meantime, please don't hesitate to bombard us, the other
    mentors, and the list with any questions you may have.

    Michael

    [1] http://i.imgur.com/at3vIHh.png




------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to