In order to use Decision Trees, you'll have to reduce the number of
features, by using feature selection:
http://scikit-learn.org/stable/modules/feature_selection.html

You can also use a classifier that handles sparse matrices, such as Naive
Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html

Finally, as you suggested, you can try to convert the numpy array to 8 bits
integer representation:
import numpy as np
[...]
X_train = vectorizer.fit_transform(corpus)
X_train = X_train.astype(np.int8)


Yours,
Felipe


On Mon, Apr 7, 2014 at 8:35 PM, Anand Viswanathan <
[email protected]> wrote:

> Hi Team,
> >
> > Kindly help me in the following memory problem and to continue with the
> research.
> >
> > Problem Statement
> >
> > I am using a document of 1600000 lines and ~66k features. I am using the
> bag of words approach to build a decision tree. Following code is working
> fine for 1000 line document. But throws memory error for the actual 1600000
> line document. My Server has a 64GB of RAM.
> >
> > Instead of using .todense() or .toarray(), is there any way to use the
> sparse matrix directly ? OR Is there any options to reduce the default type
> float64? Kindly help me on this.
> >
> > Code:
> >
> > vectorizer = TfidfVectorizer(sublinear_tf=True,
> max_df=0.5,stop_words='english')
> > X_train = vectorizer.fit_transform(corpus)
> >
> > clf = tree.DecisionTreeClassifier()
> > clf = clf.fit(X_train.todense(),corpus2)
> >
> > Error:
> >
> > Traceback (most recent call last):
> >   File "test123.py", line 103, in <module>
> >     clf = clf.fit(X_train.todense(),corpus2)
> >   File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line
> 458, in todense
> >     return np.asmatrix(self.toarray())
> >   File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py",
> line 550, in toarray
> >     return self.tocoo(copy=False).toarray()
> >   File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219,
> in toarray
> >     B = np.zeros(self.shape, dtype=self.dtype)
> > MemoryError
> >
> >
> >
> > Thanks and regards,
> > Anand Viswanathan.
> > Student Assistant - DMKD lab
> > Student Id : 004422334
> > Access Id  : fo0111
> > Wireless   : (313)655-2520
> >
>
>
> ------------------------------------------------------------------------------
> Put Bad Developers to Shame
> Dominate Development with Jenkins Continuous Integration
> Continuously Automate Build, Test & Deployment
> Start a new project now. Try Jenkins in the cloud.
> http://p.sf.net/sfu/13600_Cloudbees
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to