Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

Olivier Grisel Wed, 25 Jan 2012 09:00:51 -0800

2012/1/25 Paolo Losi <[email protected]>:
> Hi Oliver,
>
> your reply is very informative (as always :-) ).
> I've got a couple of question for you. See below...
>
> On Tue, Jan 24, 2012 at 1:57 PM, Olivier Grisel <[email protected]>
> wrote:
>>
>> If you can cheaply collect unsupervised data that looks similar to
>> your training set (albeit without the labels and in much larger
>> amount) it might be interesting to compute cluster centers using
>> MinibatchKMeans and then project your data on the space using a non
>>
>> linear transform (e.g. a RBF kernel) and add this additional features
>> to the original features (horizontal concatenation of the 2 datasets)
>> and then fit the classifier with the labels on this.
>
>
> Once you have clustered the unlabeled samples,
> you can add, as extra features on the labeled samples,
> the distance from each cluster center (e.g. computed
> via RBF kernel).
> Is that what you are suggesting?


They are more similarities than distances once they go through the RBF
function, but yes :)

> Is that effective? Can you point to any paper discussing
> the effectiveness of the approach?

For picture classifier yes (along with patch extraction and pooling):

  http://www.stanford.edu/~acoates/papers/coatesng_icml_2011.pdf

The sparse coding idea in general is based on this kind of pipeline
architecture as well. For text applications I have no reference but it
will intuitively guess why it works as a correction for sparse inputs
in high dim spaces: it's kind of feature completion with topical features.

When you use PCA + linear projection instead of k-means + rbf kernel,
the scheme is called Latent Semantic Indexing although it's usually
used for performing euclidean nearest neighbours search rather than
semi-supervised text classification.

> I've never had a chance to master semi-supervised learning...
> Any pointer from where to start is really appreciated.

I don't know semi-supervised learning well in general. What I
described is usually better known as "unsupervised feature extraction"
which can be viewed as a sub-field of semi-supervised learning when
the extracted features are used as in input for a supervised model.

For semi-supervised learning itself, this book looks like a good
reference: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11015
(at least the chapter on label propagation / spreading is interesting,
I have not read the other chapters).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

Reply via email to