Greetings.

A few months ago, I mentioned that I had written a PHP extension that 
provided English language suffix stemming using a "Porter stemmer", an 
algorithm devised by Dr. Martin Porter for stripping the suffix (or 
suffixes) off of an English word. 

After a mention in the PHP weekly summary, I received a few emails 
expressing some interest. I decided to go back to the drawing board and 
redo the extension, adding a lot of functionality. Namely, I've added a 
bunch of languages.

After reading about Dr. Porter's latest stemming adventure, Snowball, I 
decided that the logical course would be to build an extension around 
Snowball. Snowball (see http://snowball.sourceforge.net) is essentially a 
string parsing language that parses a file describing a stemming algorithm 
and then spits out the equivalent C code. 

Although I have no idea how to write a stemmer in Snowball, I do know C and 
PHP, so I've taken the stemmers available at Dr. Porter's site and wrote a 
PHP extension around them. The stemmers, and Snowball itself, are all 
covered under a BSD-like license, so I figured they'd be a good fit with 
PHP.

Anyways, here's the basic synopsis of the extension:

Stemmers can be called in one of two manners. 

string stem(string wordToStem [, int language)
string stem_LANGUAGE(string wordToStem)

I'm not sure which is preferred, so for the time being, either can be used. 
In the first form, language is one of:

STEM_PORTER -- the original Porter stemmer
STEM_ENGLISH -- an improved English stemmer
STEM_FRENCH
STEM_SPANISH
STEM_DUTCH
STEM_DANISH
STEM_GERMAN
STEM_ITALIAN
STEM_NORWEGIAN
STEM_PORTUGUESE
STEM_RUSSIAN
STEM_SWEDISH

By default, STEM_PORTER is used.

In the second form, LANGAUGE is one of the aforementioned languages, 
without the STEM_, i.e. stem_french(), stem_spanish(), etc. 

I'll likely add aliases like "STEM_FRANCAIS", "STEM_ESPANOL", etc.

On success, stem() or stem_LANGAUGE() return the word without it's suffix. 
On error, E_NOTICE will be raised, and the function will return false.

Each stemmer uses standard latin encodings, except for the Russian stemmer, 
which uses Cyrillic KO18-R encoding.

A quick demonstration (minus the Russian stemmer, 'cause the encoding might 
screw up this post):

<?php

echo "Original porter (default): assassinations -> " . 
stem("assassinations") . "\n";
echo "English: devestating -> " . stem("devestating", STEM_ENGLISH) . "\n";
echo "French: majestueusement -> " . stem("majestueusement", STEM_FRENCH) 
. "\n";
echo "Spanish: chicharrones -> " . stem("chicharrones", STEM_SPANISH) . 
"\n";
echo "Dutch: lichamelijkheden -> " . stem("lichamelijkheden", STEM_DUTCH) 
. "\n";
echo "German: aufeinanderschlügen -> " . stem("aufeinanderschlügen", 
STEM_GERMAN) . "\n";
echo "Italian: pronunciamento -> " . stem("pronunciamento", STEM_ITALIAN) 
. "\n";
echo "Norwegian: havnemyndighetene -> " . stem("havnemyndighetene", 
STEM_NORWEGIAN) . "\n";
echo "Portuguese: quilométricas -> " . stem("quilométricas", 
STEM_PORTUGUESE) . "\n";
echo "Swedish: klostergården -> " . stem("klostergården", STEM_SWEDISH) . 
"\n";

?>

Output:

Original porter (default): assassinations -> assassin
English: devestating -> devest
French: majestueusement -> majestu
Spanish: chicharrones -> chicharron
Dutch: lichamelijkheden -> licham
German: aufeinanderschlügen -> aufeinanderschlüg
Italian: pronunciamento -> pronunc
Norwegian: havnemyndighetene -> havnemyndighet
Portuguese: quilométricas -> quilométr
Swedish: klostergården -> klostergård


Case is sensitive at the moment, but only on a suffix. Words should be sent 
to the functions in lower-case. (I.e., AsSaSSinations stems to "AsSaSSin", 
while "assassinaTions" stems to "assassinaTion".)

If there's any interest in the extension for either personal use or as an 
addition to PHP itself (or any general comments, questions or suggestions) 
let me know.

J



-- 
PHP Development Mailing List <http://www.php.net/>
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to