2012/1/30 Dimitrios Pritsos <[email protected]>: > Hello, > > Still working on Several Aproaches related to AGI (Automated Genre > Identificaiton). I tried SGD with partila_fit() and it was quite > impresive that it can Fit a huge amount of date without woring to much > about fitting your data to RAM plus is works fine with PyTables EArray, etc. > > Now I am testing OneClassSVM on the same problem which it performs very > bad.
OneClassSVM is not a binary classifier, it's an unsupervised model that can be used to perform novelty detection or density estimation. If you want to do non linear binary classifiation you should try SVC with a RBF kernel on normalized data (very important) on a sub sample of you data (let say 5000 data points sampled randomly from you data). > So I would like to test few other things Like PCA as a Classifier. First PCA is not a classifier, it's an unsupervised data transformation (linear dimensionality reduction). You could use it as a preprocessing step for a non-linear classifier. If you use it as a preprocessing step for a linear classifier it won't improve the performance against directly applying the linear model on the raw data (PCA could truncate noise but the regularizer of the linear model should be able to deal with noisy features as well directly). > I thought that a preforming a simple PCA and then keeping the data > Tasformed could work as my Model and then I could use the Mhalanobis > Distance to find how close my New Data Point was to my Model. However > the Score() function of ProbabilisticPCA() is not returning such a > metric. Any Ideas How I could do that? If you want to find from which component you data is the closest to your model you could use the `pairwise_distances` function from the `sklearn.metrics` package and use it with you new data point (wrapped as a 2D array with one single row) and the pca_model.components_ . However I don't see how that would help you classify you data as as I mentioned previously PCA is an unsupervised model and it's components are unrelated to you genre classes. PCA could be used as a preprocessing step for SVC with a RBF kernel though. In that case you'd better use RandomizedPCA which is much more scalable that the vanilla implementation. However I don't expect such a pipeline to improve upon a linear model for text classification. IMHO, rather than picking statistical model at random you'd better focus on improving the way you extract your features (e.g. using bi-grams of consecutive words and maybe window based co-occurrences). You should also review the literature to know which kind of features are best for writer genre identification. I am not sure this is a problem where you can reach a high accuracy. Even human people would have a hard time beating randomness on such a task. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
