2012/1/30 Dimitrios Pritsos <[email protected]>:
> Hello,
>
> Still working on Several Aproaches related to AGI (Automated Genre
> Identificaiton). I tried SGD with partila_fit() and it was quite
> impresive that it can Fit a huge amount of date without woring to much
> about fitting your data to RAM plus is works fine with PyTables EArray, etc.
>
> Now I am testing OneClassSVM on the same problem which it performs very
> bad.

OneClassSVM is not a binary classifier, it's an unsupervised model
that can be used to perform novelty detection or density estimation.

If you want to do non linear binary classifiation you should try SVC
with a RBF kernel on normalized data (very important) on a sub sample
of you data (let say 5000 data points sampled randomly from you data).

> So I would like to test few other things Like PCA as a Classifier.

First PCA is not a classifier, it's an unsupervised data
transformation (linear dimensionality reduction). You could use it as
a preprocessing step for a non-linear classifier. If you use it as a
preprocessing step for a linear classifier it won't improve the
performance against directly applying the linear model on the raw data
(PCA could truncate noise but the regularizer of the linear model
should be able to deal with noisy features as well directly).

> I thought that a preforming a simple PCA  and then keeping the data
> Tasformed could work as my Model and then I could use the Mhalanobis
> Distance to find how close my New Data Point was to my Model. However
> the Score() function of ProbabilisticPCA() is not returning such a
> metric. Any Ideas How I could do that?

If you want to find from which component you data is the closest to
your model you could use the `pairwise_distances` function from the
`sklearn.metrics` package and use it with you new data point (wrapped
as a 2D array with one single row) and the pca_model.components_ .
However I don't see how that would help you classify you data as as I
mentioned previously PCA is an unsupervised model and it's components
are unrelated to you genre classes.

PCA could be used as a preprocessing step for SVC with a RBF kernel
though. In that case you'd better use RandomizedPCA which is much more
scalable that the vanilla implementation. However I don't expect such
a pipeline to improve upon a linear model for text classification.

IMHO, rather than picking statistical model at random you'd better
focus on improving the way you extract your features (e.g. using
bi-grams of consecutive words and maybe window based co-occurrences).
You should also review the literature to know which kind of features
are best for writer genre identification. I am not sure this is a
problem where you can reach a high accuracy. Even human people would
have a hard time beating randomness on such a task.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to