On 01/30/2012 03:16 PM, Olivier Grisel wrote:
> 2012/1/30 Dimitrios Pritsos<[email protected]>:
>> So Cross-validation module it seems NOT
>> to be appropriet for this Class of Problems. So, I thought that it might
>> be useful if an extension for this kind of problems could be added.
> I guess you are speaking about
> sklearn.cross_validation.cross_val_score or
> sklearn.grid_search.GridSearchCV.
>
> In that case you can pipeline the feature extraction and the
> classifier and do cross validation on both at the same time:
>
>    
> https://github.com/scikit-learn/scikit-learn/blob/master/examples/grid_search_text_feature_extraction.py
>
I studied the code and it seems that the CountVectorizer() will use a TF 
Dictionary of the whole collection for both the Training Set and the 
CrossValidation set. This will result Performance that will Favorite the 
Learner (whichever SGD, SVM etc) because all the Texts will be Projected 
to a Global Vector Space that will be assumed to be known in the first 
place. However this is not the case neither in OneClassClassification 
nor the Mulit-ClassCassification problems. As for the second case 
depending on the Scaling of the Data Set, you might not notice a 
difference, still in a high scalable problem the size of the feature set 
might vary a lot.

I have built an htm2vector module and in fact is giving me a great 
difference in the number of features extracted (say character Ngrams ) 
depending on the amount of Training Samples (i.e. Documents). I have 
already tested a variate of HtmlCleanUp strategies and tools and this is 
the same case i.e. Different amount of Dictionary gives a different 
Lengths of a 3-gram-Dictionary (Depending ONLY on the Training Set).

Therefore, even thought in the Grid-Search there might not be a great 
difference for finding the optimal parameters, in the Evaluation phase 
the results will be estimated greater than it should be and than it will 
be in a real world case where the unknown document will be projected to 
the Dictionary that it would have been defined while fitting the model, 
say in a Authors identification problem.

Best Regards,

Dimitrios

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to