On 01/30/2012 03:16 PM, Olivier Grisel wrote: > 2012/1/30 Dimitrios Pritsos<[email protected]>: >> So Cross-validation module it seems NOT >> to be appropriet for this Class of Problems. So, I thought that it might >> be useful if an extension for this kind of problems could be added. > I guess you are speaking about > sklearn.cross_validation.cross_val_score or > sklearn.grid_search.GridSearchCV. > > In that case you can pipeline the feature extraction and the > classifier and do cross validation on both at the same time: > > > https://github.com/scikit-learn/scikit-learn/blob/master/examples/grid_search_text_feature_extraction.py > I studied the code and it seems that the CountVectorizer() will use a TF Dictionary of the whole collection for both the Training Set and the CrossValidation set. This will result Performance that will Favorite the Learner (whichever SGD, SVM etc) because all the Texts will be Projected to a Global Vector Space that will be assumed to be known in the first place. However this is not the case neither in OneClassClassification nor the Mulit-ClassCassification problems. As for the second case depending on the Scaling of the Data Set, you might not notice a difference, still in a high scalable problem the size of the feature set might vary a lot.
I have built an htm2vector module and in fact is giving me a great difference in the number of features extracted (say character Ngrams ) depending on the amount of Training Samples (i.e. Documents). I have already tested a variate of HtmlCleanUp strategies and tools and this is the same case i.e. Different amount of Dictionary gives a different Lengths of a 3-gram-Dictionary (Depending ONLY on the Training Set). Therefore, even thought in the Grid-Search there might not be a great difference for finding the optimal parameters, in the Evaluation phase the results will be estimated greater than it should be and than it will be in a real world case where the unknown document will be projected to the Dictionary that it would have been defined while fitting the model, say in a Authors identification problem. Best Regards, Dimitrios ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
