Re: Document size rules of thumb

Sandra Clover Thu, 08 Oct 2009 06:39:00 -0700

Thanks for the tip Robin - I was wondering what was the difference
between the 2 but was unable to find anything on them. On this topic is
there anything else I should be aware of between the 2 models?�Bayes
Algorithm: good for ??CBayes algorithm: good for multiclass
classification (categories > 2)�


  ----- Original Message -----
  From: "Robin Anil"
  To: [email protected]
  Subject: Re: Document size rules of thumb
  Date: Thu, 8 Oct 2009 13:39:20 +0530


  one more tip: You will encounter better results with cbayes algorithm
  instead of bayes algorithm for multiclass
  classification(categories>2)

  On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil wrote:

  >
  >
  > On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover wrote:
  >
  >> Hi Ted, Thanks for the response. To answer your questions: 1. I
  have
  >> 576 categories2. I started with 5 training document per category.
  Went up
  >> to 10 but error levels ramained the same. Am going to up to 30
  documents
  >> and am going to increase the length of the documents. How did you
  derive
  >> the 50 words of training data for some topics? Curious... S.
  >>
  >>
  > 30 documents is too less if words overlap across categories and you
  dont
  > have enought discriminative words for each categories.
  >
  > Again with 576 categories you need really good discriminative words
  in each
  > category to be able to cover all the unknown documents you wish to
  classify
  >
  > ----- Original Message -----
  >> From: "Ted Dunning"
  >> To: [email protected]
  >> Subject: Re: Document size rules of thumb
  >> Date: Wed, 7 Oct 2009 10:21:20 -0700
  >>
  >>
  >> Sandra,
  >>
  >> This is a classic case of over-fitting. I suspect training data
  >> inadequacy. One thing you don't say is how many categories you
  have
  >> and how
  >> many training documents per categories you have. You point (2)
  might
  >> indicate that you have as little as 50 words of training data for
  >> some
  >> topics. That would make it difficult for even the best classifiers
  to
  >> get a
  >> sharp result.
  >>
  >> I would recommend the following:
  >>
  >> a) get more training data (always a good thing even if often
  >> infeasible)
  >>
  >> b) try a few other algorithms. I would recommend trying Luduan
  (from
  >> my
  >> dissertation, pdf sent to you in a separate email), confidence
  >> weighted
  >> learning (see http://www.cs.jhu.edu/~mdredze/publications/,
  >> especially
  >> http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal
  (
  >> http://hunch.net/~vw/)
  >>
  >> c) post your data for others to try
  >>
  >> Hope this helps.
  >>
  >> On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:
  >>
  >> > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 â€“ I think I am
  using a
  >> > branch version. Currently trying to install the trunk version
  >> >
  >> > 1. The data I am trying to classify is from scientific papers -
  >> > essentially the abstract title, text and keywords of there paper
  -
  >> > example below
  >> >
  >> > 2. No data source is under 300 characters
  >> >
  >> > 3. I am training using the Mahout naive Bayes and am getting low
  >> > incorrectly classified rates something like: 1.67% - Iâ€™m
  quite
  >> happy
  >> > with thatâ€¦
  >> >
  >> > 4. After I have trained the model Robin I use the Mahout naive
  >> Bayes
  >> > classify() method to classify new (unseen) data (with the
  >> classification
  >> > already known) - this is where I start to get problems - I get
  very
  >> poor
  >> > successful classification rates for new data. Something like:
  82%
  >> > unsuccessful classified.
  >> >
  >> >
  >> >
  >> > To Summarise: I get very good results in training and very poor
  >> results
  >> > with new data.
  >> >
  >>
  >>
  >>
  >> --
  >> Ted Dunning, CTO
  >> DeepDyve
  >>
  >> --
  >> Be Yourself @ mail.com!
  >> Choose From 200+ Email Addresses
  >> Get a Free Account at www.mail.com!
  >>
  >>
  >

-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free Account at www.mail.com!

Re: Document size rules of thumb

Reply via email to