Re: Getting Started with Classification

Miles Osborne Wed, 22 Jul 2009 16:04:25 -0700

this is the class imbalance problem  (ie you have many more instances for
one class than another one).


in this case, you could ensure that the training set was balanced (50:50);
more interestingly, you can have a prior which corrects for this.  or, you
could over-sample or even under-sample the training set, etc etc.

Miles

2009/7/22 Grant Ingersoll <[email protected]>

> <done_basking>Grant</done_basking>
>
> Here's an interesting piece:
> 09/07/22 18:23:02 INFO bayes.TestClassifier:
> Testing:wikipedia/subjects/prepared-test/history.txt
> 09/07/22 18:23:07 INFO bayes.TestClassifier: history    95.458984375
>  3910/4096.0
> 09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
> 09/07/22 18:23:07 INFO bayes.TestClassifier:
> Testing:/wikipedia/subjects/prepared-test/science.txt
> 09/07/22 18:23:08 INFO bayes.TestClassifier: science    15.554072096128172
>      233/1498.0
> 09/07/22 18:23:08 INFO bayes.TestClassifier:
> =======================================================
>
>
> In other words, I'm really good at predicting History as a category and
> really bad at predicting Science.
>
> I think the following might help explain why:
> ls -l
> total 245360
> -rwxrwxrwx  1 grantingersoll  staff  89518235 Jul 22 17:53 history.txt*
> -rwxrwxrwx  1 grantingersoll  staff  36099183 Jul 22 17:53 science.txt*
>
> The number of history examples is almost double the number of science based
> on my test set.
>
> There is obviously a teaching moment here.  I know there is a lot out there
> about sample sizes, feature selection etc., can we boil some of these down
> into some cogent recommendations for our users?
>
>
> -Grant
>
> On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:
>
>  <basking>Grant</basking>
>>
>> On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:
>>
>>  Getting something to run is a big step.  It is important to bask in the
>>> glow
>>> for a tiny moment.
>>>
>>> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll <[email protected]
>>> >wrote:
>>>
>>>  Confusion Matrix
>>>> -------------------------------------------------------
>>>> a       b       <--Classified as
>>>> 3910    186      |  4096        a     = history
>>>> 1265    233      |  1498        b     = science
>>>> Default Category: unknown: 2
>>>> </snip>
>>>>
>>>> At least it's better than 50%, which is presumably a good thing ;-)  I
>>>> have
>>>> no clue what the state of the art is these days, but it doesn't seem
>>>> _horrendous_ either.
>>>>
>>>> I'd love to see someone validate what I have done.  Let me know if you
>>>> need
>>>> more details.  I'd also like to know how I can improve it.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Getting Started with Classification

Reply via email to