hi,

I have been using sklearn for a while now but I only recently started to
figure out how to make sure I am using it correctly and that the results I
get are meaningful so, the following questions are fairly general questions
about machine learning applied to text content. Hopefully, someone who has
more experience than me on this kind of problem might be able to help me
out...

>From a high level perspective, my problem is very simple: I have a lot of
sentences and I want to perform a supervised binary classification on them
based on the content of the text (as well as a couple of other features but
these are not really relevant here). I started from the obvious approach
which is to create a feature per word that appears in my training corpus. I
then did the second obvious thing to avoid having too many features which
is to keep only the features based on words with a high tfidf.

Intuitively, I see how for a fixed number of words, I only need to increase
the size of the training corpus to avoid overfitting but I wonder in
general what guideline I should follow to estimate the number of features
to keep based on the size of the training corpus.

The other obvious next step for me is to use word ngrams to increase the
number of meaningful features (and, obviously, use the tfidf of these word
ngrams to keep only a fraction of the total number of ngrams). Here again,
I wonder if someone could give me advice on a strategy to pick the number
of word ngram features to keep. Pointers to resources which discuss these
issues would be most welcome since I seem unable to feed the right keywords
to google to get meaningful results.

thanks again for this impressive piece of code!
Mathieu
-- 
Mathieu Lacage <[email protected]>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to