Oh wow, very cool. Thank you very much for the assistance and info Alexander!

-----Original Message-----
From: afabisch [mailto:afabi...@mailhost.informatik.uni-bremen.de] 
Sent: Saturday, April 18, 2015 9:15 AM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] TSNE Memory Error

Hi Jason,

memory is a problem in our implementation of MNIST. I sent a detailed list of 
the required memory to this mailing list some month ago. You can find it here:

http://sourceforge.net/p/scikit-learn/mailman/message/33090573/

The number of features is irrelevant. Only the number of samples is important. 
You have too many samples because the algorithm requires
O(n^2) space (in your case probably about 30 GB). I would not use the original 
t-SNE algorithm for this dataset anyway because the complexity is O(n^2) as 
well, which means that you would have to wait some days or weeks for the result.

There is a new pull request that implements Barnes-Hut t-SNE here:

https://github.com/scikit-learn/scikit-learn/pull/4025

The advantage of Barnes-Hut t-SNE in comparison to t-SNE is that you would have 
a complexity of O(n log n). However, at the moment the full distance matrix is 
still computed so that would not fix your original problem but I think the 
memory problem should be solved soon.

In your case you could take half of the dataset. The number of features is not 
critical at all. You can take all 93 features without any dimensionality 
reduction.

Best regards,

Alexander

Am 2015-04-18 01:48, schrieb Jason Wolosonovich:
> Hello All,
> 
> My dataset has 93 features and just under 62,000 observations (61,878 
> to be exact). I'm running out of memory right after the mean sigma 
> value is computed/displayed. I've tried using dimensionality reduction 
> via TruncatedSVD with n_components set at different levels (78, 50 and
> 2 respectively) prior to sending the data to TSNE but I still run out 
> of memory. For TSNE, n_components=2 and perplexity=40 (I've also tried 
> 20). I've got 24GB of RAM on my 64-bit windows 7 machine. Should I try 
> a subsample of the dataset and if so, does anyone have a 
> recommendation on the size? Thanks!
> 
> -Jason
> ----------------------------------------------------------------------
> -------- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT 
> Develop your own process in accordance with the BPMN 2 standard Learn 
> Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> 
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to