Aha, yes.

AI::Categorizer lets you customize the tokenization behavior to be however you want, by subclassing the Document class and overriding the tokenize() method. You could do something like this:

{
  package My::Documents;
  @ISA = qw(AI::Categorizer::Document::Text);
  sub tokenize {
    return [split ' ', $_[1]];
  }
}
my $c = new AI::Categorizer(
              document_class => 'My::Documents',
);

...

 -Ken

On Feb 5, 2005, at 11:23 AM, Jason Armstrong wrote:

Thanks for all the good feedback, I'll certainly be following up on it.

I did find one reason why I wasn't getting good matches ... when I
looked more carefully at the perl data structure, I found that the
'features' hash only contained alphabetic characters. So, for example,
in the string 'WARRIOR 14-160 14-160', only the warrior part was being
used. Also, with 'BMW 318i' and 'BWM 525i', the numbers were being
ignored, and with something like 'A/T', two separate features 'a' and
't' were there.

So my further question is how to get NaiveBayes to use white space
separated words as features ('318i', 'a/t') and not just the individual
alphabetic characters. Is it a simple option when calling
new AI::Categorizer?

--
Jason Armstrong




Reply via email to