Hello all,

2 class classification problem. 13 features - mostly categorical. Some
features have 2000, 700 etc different values. So a 1-of-N encoding
transform expands the data set up to 4.5k features. Data has around
1.5 million samples.

On trying to transform the data using DictVectorizer(sparse=False) I
get a ValueError: array is too big. If I omit the sparse=False option
I get a scipy sparse matrix which the fit() method of
RandomForestClassifier does not accept. Also a .toarray() method does
not work as that too results in a huge array.

What is the way out of this?

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to