Another module you may want to use in conjunction is String::Approx. It
uses the Levenshtein edit distance to determine whether a string
approximately matches another or not. In both your volvo and warrior
cases, it would match correctly.
Good luck!
On Feb 4, 2005, at 2:18 AM, Jason Armstrong wrote:
Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for
a
way to intelligently reformat the data, and add the mistyped entry for
future use.
I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus
is
more on implementing something practical, either in perl or in C.
I've been looking at AI::Categorizer. I have a list of all valid
vehicle
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:
Briefly:
my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
( name => 'Vehicles', ),
learner_class => 'AI::Categorizer::Learner::NaiveBayes');
my $l = $c->learner;
my %docs;
foreach (vehicle descriptions) {
$docs{$i}->{content} = $content;
$docs{$i++}->{category} = [$content];
}
foreach (keys %docs) {
$c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
}
$l->train;
Sometimes it works well:
input: VOLVO
output: VOLVO FH 12
Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):
input: WARRIOR
output: PORSCHE 911 CARRERA
In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))
Even when I add the content directly:
$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
$c->knowledge_set->make_document(
name => $i++, categories => [$category], content =>
$content);
I still get the above result.
I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:
if (-d directory) {
$l->restore_state(directory);
}
But then when I try to do anything:
Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line
28
Two other things:
1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
access the data via a database, for example.
I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.
Are there any suggestions?
Thank-you in advance.
--
Jason Armstrong