On 01/30/2012 01:19 PM, Olivier Grisel wrote:
> 2012/1/30 Dimitrios Pritsos<[email protected]>:
>> Hello,
>>
>> Still working on Several Aproaches related to AGI (Automated Genre
>> Identificaiton). I tried SGD with partila_fit() and it was quite
>> impresive that it can Fit a huge amount of date without woring to much
>> about fitting your data to RAM plus is works fine with PyTables EArray, etc.
>>
>> Now I am testing OneClassSVM on the same problem which it performs very
>> bad.
> OneClassSVM is not a binary classifier, it's an unsupervised model
> that can be used to perform novelty detection or density estimation.

> If you want to do non linear binary classifiation you should try SVC
> with a RBF kernel on normalized data (very important) on a sub sample
> of you data (let say 5000 data points sampled randomly from you data).

Yes I know that, I just thought that the Hyper-Elliptical contur  might 
work fine as Single Class Classifier.

I just used EmpiricalCovariance() and its it seems to have similar 
performance to OneClassSVM (given an empirical Threshold), which is 
still very low in Precision measures but quite good in Recall. So this 
seems not a very good avenue to follow.

However, I am not sure that EmpiricalCovariance() does what I am trying 
do whit the PCA and the Hyper-Elliptical Contur and Mahalanobis Distance.


>> So I would like to test few other things Like PCA as a Classifier.
> First PCA is not a classifier, it's an unsupervised data
> transformation (linear dimensionality reduction). You could use it as
> a preprocessing step for a non-linear classifier. If you use it as a
> preprocessing step for a linear classifier it won't improve the
> performance against directly applying the linear model on the raw data
> (PCA could truncate noise but the regularizer of the linear model
> should be able to deal with noisy features as well directly).
>
Yes I understand your reasoning, and using PCA as preprocessing step it 
seems that will only accelerate the convergence when using SVC with 
linear kernel because my Features are about 10000 in some cases and 
using rbf gives works performance.

>> I thought that a preforming a simple PCA  and then keeping the data
>> Tasformed could work as my Model and then I could use the Mhalanobis
>> Distance to find how close my New Data Point was to my Model. However
>> the Score() function of ProbabilisticPCA() is not returning such a
>> metric. Any Ideas How I could do that?
> If you want to find from which component you data is the closest to
> your model you could use the `pairwise_distances` function from the
> `sklearn.metrics` package and use it with you new data point (wrapped
> as a 2D array with one single row) and the pca_model.components_ .
> However I don't see how that would help you classify you data as as I
> mentioned previously PCA is an unsupervised model and it's components
> are unrelated to you genre classes.
Thank you I will try that.
> PCA could be used as a preprocessing step for SVC with a RBF kernel
> though. In that case you'd better use RandomizedPCA which is much more
> scalable that the vanilla implementation. However I don't expect such
> a pipeline to improve upon a linear model for text classification.
I can use this tip for comparing the Closed-Set Approaches of the AGI 
problem (which is the most studded approach until now in the Literature 
and Conferences related to IR and Text Categorization) to the Novelty 
Detection ones. Thank you for that!
However, using a Binary formation of the Vectors (representing the 
Pages) will do about the same kind of "filtering", I mean I don't know 
if it will help to get more than 95.55% of Acc I am getting now. Still I 
am always curious and I will try it.
> IMHO, rather than picking statistical model at random you'd better
> focus on improving the way you extract your features (e.g. using
> bi-grams of consecutive words and maybe window based co-occurrences).
> You should also review the literature to know which kind of features
> are best for writer genre identification. I am not sure this is a
> problem where you can reach a high accuracy. Even human people would
> have a hard time beating randomness on such a task.
>
Yes True the problem it seems quite difficult, and Feature extraction 
seem to be the most important way to improve performance comparing to 
any other approach. I am already using character N-Grams (3gras, 4grams 
etc) that are working great in Authors Identification and Plagiarisms 
detection, according to Literature.

The real problem of AGI or Webpage Genre Identification it seem to be 
described best with the Novelty detection approach or the OneClass 
Classification or Cooperative Classification, because, most of the time 
we have a huge amount of Unlabeled Pages and very few class 
representative samples (always comparing it to the scaling one is 
meeting in the Real Web). In addition a model that has been Fitted 
using, say few hundreds of samples, it will meet few millions of 
unlabeled pages that it suppose will classify correctly as say News, 
Eshops, Blogs, Forums etc.

Thanx for all the above advice they are very helpful!

One more think to note realted to the Crossvalidation(). When perfocming 
Crossvalidation on Document realated problems there is a Dictionary that 
all the page's features are alinged for every sample. However this 
Dictionary it should be Extracted only from the Training sample. In this 
way the Test data are projected to the vector space that the problem has 
originally defined in the training phase. Consistently, for every fold 
in the k-fold crossvalidation the Dictionary is not the same and the 
features are slightly diferent. So Cross-validation module it seems NOT 
to be appropriet for this Class of Problems. So, I thought that it might 
be useful if an extension for this kind of problems could be added.

Best Regards,

Dimitrios










------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to