I recently read about the approximation and I think it would be a great addition. Do you think it makes sense to include it via an ``algorithm`` paramter to tSNE? I totally agree with what Kyle said about demonstrating speedups and approximation accuracy (if possible).

I haven't used Cython.parallel before. Does it degrade gracefully if OpenMP is not available?
If so, we should really be using it much more.



On 12/24/2014 02:28 PM, Kyle Kastner wrote:
Sounds like an excellent improvement for usability!

If you could benchmark time spent, and show that it is a noticeable improvement that will be crucial. Also showing how bad the approximation is compared to base t-SNE will be important - though there comes a point where you can't really compare, because vanilla t-SNE just never finishes! If it is a huge improvement for a small approximation cost, that is exactly the kind of engineering tradeoff that has been made in the past for other things like randomized SVD, etc.

Making approximation an option of the current solver seems OK to me, that way people can always choose which version to use if exact methods are required.

I would start the PR fairly early, once you have a working version that others can try along with a minimal test suite. There will probably be lots of tweaks for consistency, etc. but that is easy to do once the core is there and the tests cover it. I know lots of great people helped me with this in the past.

Looking forward to checking it out!

On Wed, Dec 24, 2014 at 2:13 PM, Christopher Moody <chrisemo...@gmail.com <mailto:chrisemo...@gmail.com>> wrote:

    Hi folks,
    Nick Travers and I have been working steadily
    
<https://github.com/cemoody/scikit-learn/tree/cemoody/bhtsne/sklearn/manifold>
    on a Barnes-Hut approximation of the t-SNE algorithm currently
    implemented as a manifold learning technique. This version makes
    the gradient calculation much faster, changing the computational
    time from O(N^2) to O(N log N). This effectively extends t-SNE
    from being usable on thousands of examples to millions of examples
    while only losing a little bit of precision. t-SNE was first
    published in 2009 <http://lvdmaaten.github.io/tsne/> and the
    Barnes-Hut approximation was introduced
    <http://arxiv.org/abs/1301.3342> only two years ago, but it was
    accepted into publication only this year [pdf
    <http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>]
    making it relatively new.

    Ahead of submitting a PR, I wanted to start the process for
    thinking about its inclusion (or exclusion!). It's a performance
    improvement to an established and widely-used algorithm but it's
    also quite new.

    - What do you all think?
    - Is this too new to be incorporated, or does its usefulness merit
    inclusion?
    - We've read the Contributing
    <http://scikit-learn.org/stable/developers/> docs -- what else can
    Nick & I do to make the PR as easy as we can for reviewers? And
    looking forward, what  can we do minimize code rot and ease the
    maintenance burden?

    Many thanks & looking forward to being part of scikit-learn community!
    chris & nick

    
------------------------------------------------------------------------------
    Dive into the World of Parallel Programming! The Go Parallel Website,
    sponsored by Intel and developed in partnership with Slashdot
    Media, is your
    hub for all things parallel software development, from weekly thought
    leadership blogs to news, videos, case studies, tutorials and
    more. Take a
    look and join the conversation now. http://goparallel.sourceforge.net
    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to