Aha, yes.
AI::Categorizer lets you customize the tokenization behavior to be however you want, by subclassing the Document class and overriding the tokenize() method. You could do something like this:
{ package My::Documents; @ISA = qw(AI::Categorizer::Document::Text); sub tokenize { return [split ' ', $_[1]]; } } my $c = new AI::Categorizer( document_class => 'My::Documents', );
...
-Ken
On Feb 5, 2005, at 11:23 AM, Jason Armstrong wrote:
Thanks for all the good feedback, I'll certainly be following up on it.
I did find one reason why I wasn't getting good matches ... when I looked more carefully at the perl data structure, I found that the 'features' hash only contained alphabetic characters. So, for example, in the string 'WARRIOR 14-160 14-160', only the warrior part was being used. Also, with 'BMW 318i' and 'BWM 525i', the numbers were being ignored, and with something like 'A/T', two separate features 'a' and 't' were there.
So my further question is how to get NaiveBayes to use white space separated words as features ('318i', 'a/t') and not just the individual alphabetic characters. Is it a simple option when calling new AI::Categorizer?
-- Jason Armstrong