Which classifier have you tried? Are you sure you selected the best
hyper-parameters with GridSearchCV? Have your tried to normalize the
dataset? For instance have a look at:

  http://scikit-learn.org/dev/modules/preprocessing.html

For very sparse data with large variance in the feature, you should
try the sklearn.feature_extraction.text.TfidfTransformer .

  
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

Also have you tried to fit ExtraTrees?

Is this a multiclass problem or a multilabel problem (e.g. is the
target for each example unique and exclusive or can a single sample be
assigned several labels)?

For the later case, multilabel support was recently introduced and is
documented here:

  http://scikit-learn.org/stable/modules/multiclass.html

If you can cheaply collect unsupervised data that looks similar to
your training set (albeit without the labels and in much larger
amount) it might be interesting to compute cluster centers using
MinibatchKMeans and then project your data on the space using a non
linear transform (e.g. a RBF kernel) and add this additional features
to the original features (horizontal concatenation of the 2 datasets)
and then fit the classifier with the labels on this.

You can also do the same with a linear feature extraction such as
sklearn.decomposition.RandomizedPCA or a non linear dictionary learner
such as  sklearn.decomposition.MiniBatchDictionaryLearning:

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html#sklearn.decomposition.MiniBatchDictionaryLearning

Whatever the feature extractor you use, the goal is to extract latent
components with each of them having a wide feature coverage so as to
cast a net large net that can activate those component even if the raw
input has very few active features.

There are also 2 incoming pull requests to add real support for semi
supervised learning algorithms in the scikit but not yet merged in the
master. With you could use those kind of algorithm to implement so
kind of active learning by enriching your training set by manually
annotating the most promising examples from your unsupervised dataset.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to