Which classifier have you tried? Are you sure you selected the best hyper-parameters with GridSearchCV? Have your tried to normalize the dataset? For instance have a look at:
http://scikit-learn.org/dev/modules/preprocessing.html For very sparse data with large variance in the feature, you should try the sklearn.feature_extraction.text.TfidfTransformer . http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer Also have you tried to fit ExtraTrees? Is this a multiclass problem or a multilabel problem (e.g. is the target for each example unique and exclusive or can a single sample be assigned several labels)? For the later case, multilabel support was recently introduced and is documented here: http://scikit-learn.org/stable/modules/multiclass.html If you can cheaply collect unsupervised data that looks similar to your training set (albeit without the labels and in much larger amount) it might be interesting to compute cluster centers using MinibatchKMeans and then project your data on the space using a non linear transform (e.g. a RBF kernel) and add this additional features to the original features (horizontal concatenation of the 2 datasets) and then fit the classifier with the labels on this. You can also do the same with a linear feature extraction such as sklearn.decomposition.RandomizedPCA or a non linear dictionary learner such as sklearn.decomposition.MiniBatchDictionaryLearning: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html#sklearn.decomposition.MiniBatchDictionaryLearning Whatever the feature extractor you use, the goal is to extract latent components with each of them having a wide feature coverage so as to cast a net large net that can activate those component even if the raw input has very few active features. There are also 2 incoming pull requests to add real support for semi supervised learning algorithms in the scikit but not yet merged in the master. With you could use those kind of algorithm to implement so kind of active learning by enriching your training set by manually annotating the most promising examples from your unsupervised dataset. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
