Re: Advice

Samy Kamkar Fri, 04 Feb 2005 09:21:18 -0800

Another module you may want to use in conjunction is String::Approx. It uses the Levenshtein edit distance to determine whether a string approximately matches another or not. In both your volvo and warrior cases, it would match correctly.

Good luck!

On Feb 4, 2005, at 2:18 AM, Jason Armstrong wrote:

Perhaps someone on this list has some good advice for me. I am working on a project that imports vehicle descriptions. Very often, the data capturers give invalid information, or mistyped data. I am looking for a way to intelligently reformat the data, and add the mistyped entry for future use.

I should add that I have very little experience in AI or machine learning. I don't mind spending a day or two reading up, but my focus is more on implementing something practical, either in perl or in C.

I've been looking at AI::Categorizer. I have a list of all valid vehicle descriptions (about 8200). I create for each of these a knowledge set, with the content the same as the category:
Briefly:
my $c = new AI::Categorizer(
              knowledge_set => AI::Categorizer::KnowledgeSet->new
                                ( name => 'Vehicles', ),
              learner_class => 'AI::Categorizer::Learner::NaiveBayes');
my $l = $c->learner;
my %docs;
foreach (vehicle descriptions) {
  $docs{$i}->{content} = $content;
  $docs{$i++}->{category} = [$content];
}
foreach (keys %docs) {
  $c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
}
$l->train;
Sometimes it works well:
input: VOLVO
output: VOLVO FH 12
Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):
input: WARRIOR
output: PORSCHE 911 CARRERA
In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))
Even when I add the content directly:
$content = 'WARRIOR'; $category = 'WARRIOR 14-160 14-160'; $c->knowledge_set->make_document( name => $i++, categories => [$category], content => $content);
I still get the above result.
I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:
if (-d directory) {
  $l->restore_state(directory);
}
But then when I try to do anything:
Can't call method "predict" on an undefined value at /usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line 28
Two other things:
1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
   access the data via a database, for example.
I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.
Are there any suggestions?
Thank-you in advance.
--
Jason Armstrong

Re: Advice

Reply via email to