Hi Jason,
Most likely, the reason this isn't working is that the training data isn't adequate for the task. Essentially you're feeding it a bunch of examples where the input string exactly matches the output category, and only one training example for each category, but then asking it at run-time to switch gears and deal with noisy data. Essentially, it doesn't have enough information to extrapolate a profile for each category.
What this means is that in order to use AI::Categorizer in the obvious way for this project, you're going to have to get your hands on some training data that has the same statistical properties as what you'll see at run time. That means noisy data, with all the mistypings and invalid information, and each noisy string mapped to its correction (the "category").
If you're coming into this project to try to automate a process that's been happening by hand for a while, perhaps you can get your hands on the mistyped/erroneous strings and their corrections, and use that set as training data. If not, you may have to spend some time (or hire someone) hand-categorizing your input.
If you hang around AI stuff long enough, you'll realize that this issue is often the *main* obstacle to doing machine learning, and you'll understand why people often call their training set "gold data". =)
If coming up with a good set of training data isn't an option for this project, you might try a different approach altogether. For instance, re-cast the problem as a search-engine problem, where your "documents" are your 8200 description strings, your "words" are all the character-n-grams (substrings of length n) from those strings, and your "queries" are the noisy strings you're trying to clean up. Sometimes that works pretty well.
Or you could try the Levenshtein edit distance that Samy suggested. Or you could try something else that you invent. =)
-Ken
On Feb 4, 2005, at 4:18 AM, Jason Armstrong wrote:
Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for a
way to intelligently reformat the data, and add the mistyped entry for
future use.
I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus is
more on implementing something practical, either in perl or in C.
I've been looking at AI::Categorizer. I have a list of all valid vehicle
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:
Briefly:
my $c = new AI::Categorizer( knowledge_set => AI::Categorizer::KnowledgeSet->new ( name => 'Vehicles', ), learner_class => 'AI::Categorizer::Learner::NaiveBayes');
my $l = $c->learner;
my %docs; foreach (vehicle descriptions) { $docs{$i}->{content} = $content; $docs{$i++}->{category} = [$content]; }
foreach (keys %docs) { $c->knowledge_set->make_document(name => $_, %{$docs->{$_}}); }
$l->train;
Sometimes it works well:
input: VOLVO output: VOLVO FH 12
Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):
input: WARRIOR output: PORSCHE 911 CARRERA
In fact, the 'PORSCHE 911 CARRERA' category gets returned most often (are you sure this is artificial intelligence ?-))
Even when I add the content directly:
$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
$c->knowledge_set->make_document(
name => $i++, categories => [$category], content => $content);
I still get the above result.
I also have some problems with saving the training set. After the above example, I do: $l->save_state(directory), and then exit the program. When I start it up again:
if (-d directory) { $l->restore_state(directory); }
But then when I try to do anything:
Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line 28
Two other things:
1. SVM takes forever, and then crashes after consuming all the memory. 2. Does everything need to be loaded into memory, or is there a way to access the data via a database, for example.
I'm ideally looking for something similar to DSpam, which can rate a description and suggest the best category that it belongs in.
Are there any suggestions?
Thank-you in advance.
-- Jason Armstrong