Hello all, 2 class classification problem. 13 features - mostly categorical. Some features have 2000, 700 etc different values. So a 1-of-N encoding transform expands the data set up to 4.5k features. Data has around 1.5 million samples.
On trying to transform the data using DictVectorizer(sparse=False) I get a ValueError: array is too big. If I omit the sparse=False option I get a scipy sparse matrix which the fit() method of RandomForestClassifier does not accept. Also a .toarray() method does not work as that too results in a huge array. What is the way out of this? ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
