On 10 August 2012 01:53, mathieu lacage <[email protected]> wrote:
> hi,
>
> I have been using sklearn for a while now but I only recently started to
> figure out how to make sure I am using it correctly and that the results I
> get are meaningful so, the following questions are fairly general questions
> about machine learning applied to text content. Hopefully, someone who has
> more experience than me on this kind of problem might be able to help me
> out...
>
> From a high level perspective, my problem is very simple: I have a lot of
> sentences and I want to perform a supervised binary classification on them
> based on the content of the text (as well as a couple of other features but
> these are not really relevant here). I started from the obvious approach
> which is to create a feature per word that appears in my training corpus. I
> then did the second obvious thing to avoid having too many features which
> is to keep only the features based on words with a high tfidf.
>
> Intuitively, I see how for a fixed number of words, I only need to
> increase the size of the training corpus to avoid overfitting but I wonder
> in general what guideline I should follow to estimate the number of
> features to keep based on the size of the training corpus.
>
> The other obvious next step for me is to use word ngrams to increase the
> number of meaningful features (and, obviously, use the tfidf of these word
> ngrams to keep only a fraction of the total number of ngrams). Here again,
> I wonder if someone could give me advice on a strategy to pick the number
> of word ngram features to keep. Pointers to resources which discuss these
> issues would be most welcome since I seem unable to feed the right keywords
> to google to get meaningful results.
>
> thanks again for this impressive piece of code!
> Mathieu
> --
> Mathieu Lacage <[email protected]>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Hi Mathieu,
You may get some better answers at metaoptimize: http://metaoptimize.com/qa/
Many of the people on this mailing list are there, but so are a number of
other people.
Thanks,
Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general