Hi Team,
>
> Kindly help me in the following memory problem and to continue with the research.
>
> Problem Statement
>
> I am using a document of 1600000 lines and ~66k features. I am using the bag of words approach to build a decision tree. Following code is working fine for 1000 line document. But throws memory error for the actual 1600000 line document. My Server has a 64GB of RAM.
>
> Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR Is there any options to reduce the default type float64? Kindly help me on this.
>
> Code:
>
> vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
> X_train = vectorizer.fit_transform(corpus)
>
> clf = tree.DecisionTreeClassifier()
> clf = clf.fit(X_train.todense(),corpus2)
>
> Error:
>
> Traceback (most recent call last):
>   File "test123.py", line 103, in <module>
>     clf = clf.fit(X_train.todense(),corpus2)
>   File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
>     return np.asmatrix(self.toarray())
>   File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
>     return self.tocoo(copy=False).toarray()
>   File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
>     B = np.zeros(self.shape, dtype=self.dtype)
> MemoryError
>
>
>
> Thanks and regards,
> Anand Viswanathan.
> Student Assistant - DMKD lab
> Student Id : 004422334
> Access Id  : fo0111
> Wireless   : (313)655-2520
>

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to