Greetings.
A few months ago, I mentioned that I had written a PHP extension that provided English language suffix stemming using a "Porter stemmer", an algorithm devised by Dr. Martin Porter for stripping the suffix (or suffixes) off of an English word. After a mention in the PHP weekly summary, I received a few emails expressing some interest. I decided to go back to the drawing board and redo the extension, adding a lot of functionality. Namely, I've added a bunch of languages. After reading about Dr. Porter's latest stemming adventure, Snowball, I decided that the logical course would be to build an extension around Snowball. Snowball (see http://snowball.sourceforge.net) is essentially a string parsing language that parses a file describing a stemming algorithm and then spits out the equivalent C code. Although I have no idea how to write a stemmer in Snowball, I do know C and PHP, so I've taken the stemmers available at Dr. Porter's site and wrote a PHP extension around them. The stemmers, and Snowball itself, are all covered under a BSD-like license, so I figured they'd be a good fit with PHP. Anyways, here's the basic synopsis of the extension: Stemmers can be called in one of two manners. string stem(string wordToStem [, int language) string stem_LANGUAGE(string wordToStem) I'm not sure which is preferred, so for the time being, either can be used. In the first form, language is one of: STEM_PORTER -- the original Porter stemmer STEM_ENGLISH -- an improved English stemmer STEM_FRENCH STEM_SPANISH STEM_DUTCH STEM_DANISH STEM_GERMAN STEM_ITALIAN STEM_NORWEGIAN STEM_PORTUGUESE STEM_RUSSIAN STEM_SWEDISH By default, STEM_PORTER is used. In the second form, LANGAUGE is one of the aforementioned languages, without the STEM_, i.e. stem_french(), stem_spanish(), etc. I'll likely add aliases like "STEM_FRANCAIS", "STEM_ESPANOL", etc. On success, stem() or stem_LANGAUGE() return the word without it's suffix. On error, E_NOTICE will be raised, and the function will return false. Each stemmer uses standard latin encodings, except for the Russian stemmer, which uses Cyrillic KO18-R encoding. A quick demonstration (minus the Russian stemmer, 'cause the encoding might screw up this post): <?php echo "Original porter (default): assassinations -> " . stem("assassinations") . "\n"; echo "English: devestating -> " . stem("devestating", STEM_ENGLISH) . "\n"; echo "French: majestueusement -> " . stem("majestueusement", STEM_FRENCH) . "\n"; echo "Spanish: chicharrones -> " . stem("chicharrones", STEM_SPANISH) . "\n"; echo "Dutch: lichamelijkheden -> " . stem("lichamelijkheden", STEM_DUTCH) . "\n"; echo "German: aufeinanderschlügen -> " . stem("aufeinanderschlügen", STEM_GERMAN) . "\n"; echo "Italian: pronunciamento -> " . stem("pronunciamento", STEM_ITALIAN) . "\n"; echo "Norwegian: havnemyndighetene -> " . stem("havnemyndighetene", STEM_NORWEGIAN) . "\n"; echo "Portuguese: quilométricas -> " . stem("quilométricas", STEM_PORTUGUESE) . "\n"; echo "Swedish: klostergården -> " . stem("klostergården", STEM_SWEDISH) . "\n"; ?> Output: Original porter (default): assassinations -> assassin English: devestating -> devest French: majestueusement -> majestu Spanish: chicharrones -> chicharron Dutch: lichamelijkheden -> licham German: aufeinanderschlügen -> aufeinanderschlüg Italian: pronunciamento -> pronunc Norwegian: havnemyndighetene -> havnemyndighet Portuguese: quilométricas -> quilométr Swedish: klostergården -> klostergård Case is sensitive at the moment, but only on a suffix. Words should be sent to the functions in lower-case. (I.e., AsSaSSinations stems to "AsSaSSin", while "assassinaTions" stems to "assassinaTion".) If there's any interest in the extension for either personal use or as an addition to PHP itself (or any general comments, questions or suggestions) let me know. J -- PHP Development Mailing List <http://www.php.net/> To unsubscribe, visit: http://www.php.net/unsub.php