I'm looking into implementing a more advanced similarity algorithm than what I have now for comparing text. I'm going to use various wordstat files that categorizes words, now a lot of the words in these files have already been stemmed, ie reduced to their common "stem", for instance INFINIT*.
So I need to stem the input texts that are to be evaluated with the help of the wordstat files and I found a really good resource: http://tartarus.org/~martin/PorterStemmer/ If you for instance compare http://tartarus.org/~martin/PorterStemmer/ruby.txt with the common lisp version http://tartarus.org/~martin/PorterStemmer/commonlisp.txt it can be seen that a language like Ruby which has some of its influences from Perl will handle the task elegantly in very few lines whereas the CL version is much more bulky. At first I thought it would be a fairly effortless task to code this up in PIL but now I'm not so sure, what say you guys? Right now I'm inclined to simply use the C version. /Henrik -- UNSUBSCRIBE: mailto:[email protected]?subject=unsubscribe
