Another module you may want to use in conjunction is String::Approx. It uses the Levenshtein edit distance to determine whether a string approximately matches another or not. In both your volvo and warrior cases, it would match correctly.

Good luck!

On Feb 4, 2005, at 2:18 AM, Jason Armstrong wrote:

Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for a
way to intelligently reformat the data, and add the mistyped entry for
future use.


I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus is
more on implementing something practical, either in perl or in C.


I've been looking at AI::Categorizer. I have a list of all valid vehicle
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:


Briefly:

my $c = new AI::Categorizer(
              knowledge_set => AI::Categorizer::KnowledgeSet->new
                                ( name => 'Vehicles', ),
              learner_class => 'AI::Categorizer::Learner::NaiveBayes');

my $l = $c->learner;

my %docs;
foreach (vehicle descriptions) {
  $docs{$i}->{content} = $content;
  $docs{$i++}->{category} = [$content];
}

foreach (keys %docs) {
  $c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
}

$l->train;


Sometimes it works well:

input: VOLVO
output: VOLVO FH 12

Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

input: WARRIOR
output: PORSCHE 911 CARRERA

In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))

Even when I add the content directly:

$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
$c->knowledge_set->make_document(
name => $i++, categories => [$category], content => $content);


I still get the above result.

I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:

if (-d directory) {
  $l->restore_state(directory);
}

But then when I try to do anything:

Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line 28



Two other things:

1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
   access the data via a database, for example.

I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.

Are there any suggestions?

Thank-you in advance.

--
Jason Armstrong



Reply via email to