2012/1/25 Paolo Losi <[email protected]>: > Hi Oliver, > > your reply is very informative (as always :-) ). > I've got a couple of question for you. See below... > > On Tue, Jan 24, 2012 at 1:57 PM, Olivier Grisel <[email protected]> > wrote: >> >> If you can cheaply collect unsupervised data that looks similar to >> your training set (albeit without the labels and in much larger >> amount) it might be interesting to compute cluster centers using >> MinibatchKMeans and then project your data on the space using a non >> >> linear transform (e.g. a RBF kernel) and add this additional features >> to the original features (horizontal concatenation of the 2 datasets) >> and then fit the classifier with the labels on this. > > > Once you have clustered the unlabeled samples, > you can add, as extra features on the labeled samples, > the distance from each cluster center (e.g. computed > via RBF kernel). > Is that what you are suggesting?
They are more similarities than distances once they go through the RBF function, but yes :) > Is that effective? Can you point to any paper discussing > the effectiveness of the approach? For picture classifier yes (along with patch extraction and pooling): http://www.stanford.edu/~acoates/papers/coatesng_icml_2011.pdf The sparse coding idea in general is based on this kind of pipeline architecture as well. For text applications I have no reference but it will intuitively guess why it works as a correction for sparse inputs in high dim spaces: it's kind of feature completion with topical features. When you use PCA + linear projection instead of k-means + rbf kernel, the scheme is called Latent Semantic Indexing although it's usually used for performing euclidean nearest neighbours search rather than semi-supervised text classification. > I've never had a chance to master semi-supervised learning... > Any pointer from where to start is really appreciated. I don't know semi-supervised learning well in general. What I described is usually better known as "unsupervised feature extraction" which can be viewed as a sub-field of semi-supervised learning when the extracted features are used as in input for a supervised model. For semi-supervised learning itself, this book looks like a good reference: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11015 (at least the chapter on label propagation / spreading is interesting, I have not read the other chapters). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
