On 01/30/2012 01:19 PM, Olivier Grisel wrote: > 2012/1/30 Dimitrios Pritsos<[email protected]>: >> Hello, >> >> Still working on Several Aproaches related to AGI (Automated Genre >> Identificaiton). I tried SGD with partila_fit() and it was quite >> impresive that it can Fit a huge amount of date without woring to much >> about fitting your data to RAM plus is works fine with PyTables EArray, etc. >> >> Now I am testing OneClassSVM on the same problem which it performs very >> bad. > OneClassSVM is not a binary classifier, it's an unsupervised model > that can be used to perform novelty detection or density estimation.
> If you want to do non linear binary classifiation you should try SVC > with a RBF kernel on normalized data (very important) on a sub sample > of you data (let say 5000 data points sampled randomly from you data). Yes I know that, I just thought that the Hyper-Elliptical contur might work fine as Single Class Classifier. I just used EmpiricalCovariance() and its it seems to have similar performance to OneClassSVM (given an empirical Threshold), which is still very low in Precision measures but quite good in Recall. So this seems not a very good avenue to follow. However, I am not sure that EmpiricalCovariance() does what I am trying do whit the PCA and the Hyper-Elliptical Contur and Mahalanobis Distance. >> So I would like to test few other things Like PCA as a Classifier. > First PCA is not a classifier, it's an unsupervised data > transformation (linear dimensionality reduction). You could use it as > a preprocessing step for a non-linear classifier. If you use it as a > preprocessing step for a linear classifier it won't improve the > performance against directly applying the linear model on the raw data > (PCA could truncate noise but the regularizer of the linear model > should be able to deal with noisy features as well directly). > Yes I understand your reasoning, and using PCA as preprocessing step it seems that will only accelerate the convergence when using SVC with linear kernel because my Features are about 10000 in some cases and using rbf gives works performance. >> I thought that a preforming a simple PCA and then keeping the data >> Tasformed could work as my Model and then I could use the Mhalanobis >> Distance to find how close my New Data Point was to my Model. However >> the Score() function of ProbabilisticPCA() is not returning such a >> metric. Any Ideas How I could do that? > If you want to find from which component you data is the closest to > your model you could use the `pairwise_distances` function from the > `sklearn.metrics` package and use it with you new data point (wrapped > as a 2D array with one single row) and the pca_model.components_ . > However I don't see how that would help you classify you data as as I > mentioned previously PCA is an unsupervised model and it's components > are unrelated to you genre classes. Thank you I will try that. > PCA could be used as a preprocessing step for SVC with a RBF kernel > though. In that case you'd better use RandomizedPCA which is much more > scalable that the vanilla implementation. However I don't expect such > a pipeline to improve upon a linear model for text classification. I can use this tip for comparing the Closed-Set Approaches of the AGI problem (which is the most studded approach until now in the Literature and Conferences related to IR and Text Categorization) to the Novelty Detection ones. Thank you for that! However, using a Binary formation of the Vectors (representing the Pages) will do about the same kind of "filtering", I mean I don't know if it will help to get more than 95.55% of Acc I am getting now. Still I am always curious and I will try it. > IMHO, rather than picking statistical model at random you'd better > focus on improving the way you extract your features (e.g. using > bi-grams of consecutive words and maybe window based co-occurrences). > You should also review the literature to know which kind of features > are best for writer genre identification. I am not sure this is a > problem where you can reach a high accuracy. Even human people would > have a hard time beating randomness on such a task. > Yes True the problem it seems quite difficult, and Feature extraction seem to be the most important way to improve performance comparing to any other approach. I am already using character N-Grams (3gras, 4grams etc) that are working great in Authors Identification and Plagiarisms detection, according to Literature. The real problem of AGI or Webpage Genre Identification it seem to be described best with the Novelty detection approach or the OneClass Classification or Cooperative Classification, because, most of the time we have a huge amount of Unlabeled Pages and very few class representative samples (always comparing it to the scaling one is meeting in the Real Web). In addition a model that has been Fitted using, say few hundreds of samples, it will meet few millions of unlabeled pages that it suppose will classify correctly as say News, Eshops, Blogs, Forums etc. Thanx for all the above advice they are very helpful! One more think to note realted to the Crossvalidation(). When perfocming Crossvalidation on Document realated problems there is a Dictionary that all the page's features are alinged for every sample. However this Dictionary it should be Extracted only from the Training sample. In this way the Test data are projected to the vector space that the problem has originally defined in the training phase. Consistently, for every fold in the k-fold crossvalidation the Dictionary is not the same and the features are slightly diferent. So Cross-validation module it seems NOT to be appropriet for this Class of Problems. So, I thought that it might be useful if an extension for this kind of problems could be added. Best Regards, Dimitrios ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
