well, this definitely looks cool, from a language point of view.

i would go for new_stem or such like, and expect the language to be
determined as a variable.

I hope this allows for more work on various language features... perhaps
you'd want to spend time looking at what else is available.

one final note, is that you may wish to put this in the PEAR PECL library,
since it's a: a pretty exclusive extension, and b: that's where it should
go. :)

James

> -----Original Message-----
> From: J Smith [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, February 12, 2002 1:03 AM
> To: [EMAIL PROTECTED]
> Subject: [PHP-DEV] New extension: stem
>
>
>
> Greetings.
>
> A few months ago, I mentioned that I had written a PHP extension that
> provided English language suffix stemming using a "Porter stemmer", an
> algorithm devised by Dr. Martin Porter for stripping the suffix (or
> suffixes) off of an English word.
>
> After a mention in the PHP weekly summary, I received a few emails
> expressing some interest. I decided to go back to the drawing board and
> redo the extension, adding a lot of functionality. Namely, I've added a
> bunch of languages.
>
> After reading about Dr. Porter's latest stemming adventure, Snowball, I
> decided that the logical course would be to build an extension around
> Snowball. Snowball (see http://snowball.sourceforge.net) is essentially a
> string parsing language that parses a file describing a stemming
> algorithm
> and then spits out the equivalent C code.
>
> Although I have no idea how to write a stemmer in Snowball, I do
> know C and
> PHP, so I've taken the stemmers available at Dr. Porter's site
> and wrote a
> PHP extension around them. The stemmers, and Snowball itself, are all
> covered under a BSD-like license, so I figured they'd be a good fit with
> PHP.
>
> Anyways, here's the basic synopsis of the extension:
>
> Stemmers can be called in one of two manners.
>
> string stem(string wordToStem [, int language)
> string stem_LANGUAGE(string wordToStem)
>
> I'm not sure which is preferred, so for the time being, either
> can be used.
> In the first form, language is one of:
>
> STEM_PORTER -- the original Porter stemmer
> STEM_ENGLISH -- an improved English stemmer
> STEM_FRENCH
> STEM_SPANISH
> STEM_DUTCH
> STEM_DANISH
> STEM_GERMAN
> STEM_ITALIAN
> STEM_NORWEGIAN
> STEM_PORTUGUESE
> STEM_RUSSIAN
> STEM_SWEDISH
>
> By default, STEM_PORTER is used.
>
> In the second form, LANGAUGE is one of the aforementioned languages,
> without the STEM_, i.e. stem_french(), stem_spanish(), etc.
>
> I'll likely add aliases like "STEM_FRANCAIS", "STEM_ESPANOL", etc.
>
> On success, stem() or stem_LANGAUGE() return the word without
> it's suffix.
> On error, E_NOTICE will be raised, and the function will return false.
>
> Each stemmer uses standard latin encodings, except for the
> Russian stemmer,
> which uses Cyrillic KO18-R encoding.
>
> A quick demonstration (minus the Russian stemmer, 'cause the
> encoding might
> screw up this post):
>
> <?php
>
> echo "Original porter (default): assassinations -> " .
> stem("assassinations") . "\n";
> echo "English: devestating -> " . stem("devestating",
> STEM_ENGLISH) . "\n";
> echo "French: majestueusement -> " . stem("majestueusement", STEM_FRENCH)
> . "\n";
> echo "Spanish: chicharrones -> " . stem("chicharrones", STEM_SPANISH) .
> "\n";
> echo "Dutch: lichamelijkheden -> " . stem("lichamelijkheden", STEM_DUTCH)
> . "\n";
> echo "German: aufeinanderschlügen -> " . stem("aufeinanderschlügen",
> STEM_GERMAN) . "\n";
> echo "Italian: pronunciamento -> " . stem("pronunciamento", STEM_ITALIAN)
> . "\n";
> echo "Norwegian: havnemyndighetene -> " . stem("havnemyndighetene",
> STEM_NORWEGIAN) . "\n";
> echo "Portuguese: quilométricas -> " . stem("quilométricas",
> STEM_PORTUGUESE) . "\n";
> echo "Swedish: klostergården -> " . stem("klostergården", STEM_SWEDISH) .
> "\n";
>
> ?>
>
> Output:
>
> Original porter (default): assassinations -> assassin
> English: devestating -> devest
> French: majestueusement -> majestu
> Spanish: chicharrones -> chicharron
> Dutch: lichamelijkheden -> licham
> German: aufeinanderschlügen -> aufeinanderschlüg
> Italian: pronunciamento -> pronunc
> Norwegian: havnemyndighetene -> havnemyndighet
> Portuguese: quilométricas -> quilométr
> Swedish: klostergården -> klostergård
>
>
> Case is sensitive at the moment, but only on a suffix. Words
> should be sent
> to the functions in lower-case. (I.e., AsSaSSinations stems to
> "AsSaSSin",
> while "assassinaTions" stems to "assassinaTion".)
>
> If there's any interest in the extension for either personal use or as an
> addition to PHP itself (or any general comments, questions or
> suggestions)
> let me know.
>
> J
>
>
>
> --
> PHP Development Mailing List <http://www.php.net/>
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>


-- 
PHP Development Mailing List <http://www.php.net/>
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to