Sounds like an excellent improvement for usability!
If you could benchmark time spent, and show that it is a noticeable
improvement that will be crucial. Also showing how bad the approximation is
compared to base t-SNE will be important - though there comes a point where
you can't really compare, because vanilla t-SNE just never finishes! If it
is a huge improvement for a small approximation cost, that is exactly the
kind of engineering tradeoff that has been made in the past for other
things like randomized SVD, etc.
Making approximation an option of the current solver seems OK to me, that
way people can always choose which version to use if exact methods are
required.
I would start the PR fairly early, once you have a working version that
others can try along with a minimal test suite. There will probably be lots
of tweaks for consistency, etc. but that is easy to do once the core is
there and the tests cover it. I know lots of great people helped me with
this in the past.
Looking forward to checking it out!
On Wed, Dec 24, 2014 at 2:13 PM, Christopher Moody <chrisemo...@gmail.com>
wrote:
> Hi folks,
> Nick Travers and I have been working steadily
> <https://github.com/cemoody/scikit-learn/tree/cemoody/bhtsne/sklearn/manifold>
> on a Barnes-Hut approximation of the t-SNE algorithm currently implemented
> as a manifold learning technique. This version makes the gradient
> calculation much faster, changing the computational time from O(N^2) to O(N
> log N). This effectively extends t-SNE from being usable on thousands of
> examples to millions of examples while only losing a little bit of
> precision. t-SNE was first published in 2009
> <http://lvdmaaten.github.io/tsne/> and the Barnes-Hut approximation was
> introduced <http://arxiv.org/abs/1301.3342> only two years ago, but it
> was accepted into publication only this year [pdf
> <http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>] making it
> relatively new.
>
> Ahead of submitting a PR, I wanted to start the process for thinking about
> its inclusion (or exclusion!). It's a performance improvement to an
> established and widely-used algorithm but it's also quite new.
>
> - What do you all think?
> - Is this too new to be incorporated, or does its usefulness merit
> inclusion?
> - We've read the Contributing <http://scikit-learn.org/stable/developers/>
> docs
> -- what else can Nick & I do to make the PR as easy as we can for
> reviewers? And looking forward, what can we do minimize code rot and ease
> the maintenance burden?
>
> Many thanks & looking forward to being part of scikit-learn community!
> chris & nick
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming! The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general