Perhaps someone on this list has some good advice for me. I am working on a project that imports vehicle descriptions. Very often, the data capturers give invalid information, or mistyped data. I am looking for a way to intelligently reformat the data, and add the mistyped entry for future use.
I should add that I have very little experience in AI or machine learning. I don't mind spending a day or two reading up, but my focus is more on implementing something practical, either in perl or in C. I've been looking at AI::Categorizer. I have a list of all valid vehicle descriptions (about 8200). I create for each of these a knowledge set, with the content the same as the category: Briefly: my $c = new AI::Categorizer( knowledge_set => AI::Categorizer::KnowledgeSet->new ( name => 'Vehicles', ), learner_class => 'AI::Categorizer::Learner::NaiveBayes'); my $l = $c->learner; my %docs; foreach (vehicle descriptions) { $docs{$i}->{content} = $content; $docs{$i++}->{category} = [$content]; } foreach (keys %docs) { $c->knowledge_set->make_document(name => $_, %{$docs->{$_}}); } $l->train; Sometimes it works well: input: VOLVO output: VOLVO FH 12 Sometimes not (there is one category called 'WARRIOR 14-160 14-160'): input: WARRIOR output: PORSCHE 911 CARRERA In fact, the 'PORSCHE 911 CARRERA' category gets returned most often (are you sure this is artificial intelligence ?-)) Even when I add the content directly: $content = 'WARRIOR'; $category = 'WARRIOR 14-160 14-160'; $c->knowledge_set->make_document( name => $i++, categories => [$category], content => $content); I still get the above result. I also have some problems with saving the training set. After the above example, I do: $l->save_state(directory), and then exit the program. When I start it up again: if (-d directory) { $l->restore_state(directory); } But then when I try to do anything: Can't call method "predict" on an undefined value at /usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line 28 Two other things: 1. SVM takes forever, and then crashes after consuming all the memory. 2. Does everything need to be loaded into memory, or is there a way to access the data via a database, for example. I'm ideally looking for something similar to DSpam, which can rate a description and suggest the best category that it belongs in. Are there any suggestions? Thank-you in advance. -- Jason Armstrong