Hi Chris,

that's great! I always wanted to implement Barnes-Hut-SNE but never 
found the time. It has been mentioned already in a comment in the code 
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/t_sne.py#L4)
 
and has been discussed in the original PR 
(https://github.com/scikit-learn/scikit-learn/pull/2822, see ogrisel's 
and my comments). So I don't think it would be difficult to justify that 
we need this.

Computing the embedding of the first 10,000 MNIST images takes 
approximately 1 hour on my computer. Embedding all 70,000 images with 
Barnes-Hut t-SNE should take less man 15 minutes (see 
http://arxiv.org/pdf/1301.3342v2.pdf, p. 7). It would be great if we 
could generate a similar image like the one on page 6 in the original 
paper (http://arxiv.org/pdf/1301.3342v2.pdf) in an example. But I don't 
think that we can wait 10,000 seconds to generate this plot. We should 
do it with a maximum of 3,000-5,000 images. That will be sufficient to 
show that Barnes-Hut-SNE is much faster and accurate enough.

Best regards,

Alexander

Am 2014-12-24 20:59, schrieb Andy:
> I recently read about the approximation and I think it would be a
> great addition.
>  Do you think it makes sense to include it via an ``algorithm``
> paramter to tSNE?
>  I totally agree with what Kyle  said about demonstrating speedups
> and approximation accuracy (if possible).
> 
>  I haven't used Cython.parallel before. Does it degrade gracefully if
> OpenMP is not available?
>  If so, we should really be using it much more.
> 
>  On 12/24/2014 02:28 PM, Kyle Kastner wrote:
> 
>> Sounds like an excellent improvement for usability!
>> 
>> If you could benchmark time spent, and show that it is a noticeable
>> improvement that will be crucial. Also showing how bad the
>> approximation is compared to base t-SNE will be important - though
>> there comes a point where you can't really compare, because vanilla
>> t-SNE just never finishes! If it is a huge improvement for a small
>> approximation cost, that is exactly the kind of engineering tradeoff
>> that has been made in the past for other things like randomized SVD,
>> etc.
>> 
>> Making approximation an option of the current solver seems OK to me,
>> that way people can always choose which version to use if exact
>> methods are required.
>> 
>> I would start the PR fairly early, once you have a working version
>> that others can try along with a minimal test suite. There will
>> probably be lots of tweaks for consistency, etc. but that is easy to
>> do once the core is there and the tests cover it. I know lots of
>> great people helped me with this in the past.
>> 
>> Looking forward to checking it out!
>> 
>> On Wed, Dec 24, 2014 at 2:13 PM, Christopher Moody
>> <chrisemo...@gmail.com> wrote:
>> 
>>> Hi folks,
>>> Nick Travers and I have been working steadily [1] on a Barnes-Hut
>>> approximation of the t-SNE algorithm currently implemented as a
>>> manifold learning technique. This version makes the gradient
>>> calculation much faster, changing the computational time from
>>> O(N^2) to O(N log N). This effectively extends t-SNE from being
>>> usable on thousands of examples to millions of examples while only
>>> losing a little bit of precision. t-SNE was first published in
>>> 2009 [2] and the Barnes-Hut approximation was introduced [3] only
>>> two years ago, but it was accepted into publication only this year
>>> [pdf [4]] making it relatively new.
>>> 
>>> Ahead of submitting a PR, I wanted to start the process for
>>> thinking about its inclusion (or exclusion!). It's a performance
>>> improvement to an established and widely-used algorithm but it's
>>> also quite new. 
>>> 
>>> - What do you all think? 
>>> - Is this too new to be incorporated, or does its usefulness merit
>>> inclusion?
>>> - We've read the Contributing [5] docs -- what else can Nick & I
>>> do to make the PR as easy as we can for reviewers? And looking
>>> forward, what  can we do minimize code rot and ease the
>>> maintenance burden?
>>> 
>>> Many thanks & looking forward to being part of scikit-learn
>>> community!
>>> chris & nick
>>> 
>>> 
>> 
> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming! The Go Parallel
>>> Website,
>>> sponsored by Intel and developed in partnership with Slashdot
>>> Media, is your
>>> hub for all things parallel software development, from weekly
>>> thought
>>> leadership blogs to news, videos, case studies, tutorials and
>>> more. Take a
>>> look and join the conversation now.
>>> http://goparallel.sourceforge.net [6]
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> [7]
>> 
>> 
> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming! The Go Parallel
>> Website,
>> sponsored by Intel and developed in partnership with Slashdot Media,
>> is your
>> hub for all things parallel software development, from weekly
>> thought
>> leadership blogs to news, videos, case studies, tutorials and more.
>> Take a
>> look and join the conversation now.
>> http://goparallel.sourceforge.net [6]
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> [7]
> 
> 
> 
> Links:
> ------
> [1] 
> https://github.com/cemoody/scikit-learn/tree/cemoody/bhtsne/sklearn/manifold
> [2] http://lvdmaaten.github.io/tsne/
> [3] http://arxiv.org/abs/1301.3342
> [4] http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf
> [5] http://scikit-learn.org/stable/developers/
> [6] http://goparallel.sourceforge.net
> [7] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming! The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is 
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. 
> Take a
> look and join the conversation now. http://goparallel.sourceforge.net
> 
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to