Thanks, David. I reported this same issue to Kasun about three months ago.

Bill Burns
Verbum Communications, Inc.
+1.208.336.6081
[email protected]
http://www.verbumcomm.com


-----Original Message-----
From: David Cramer [mailto:[email protected]] 
Sent: Tuesday, January 10, 2012 9:54 PM
To: Bort, Paul
Cc: [email protected]
Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific 
words

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Paul,
Funny you should mention that. I've also been working on the client side 
stemmer recently to address the same issue you mention and some others. The 
problem was with all words ending with vowel+y (relay, array, key, say, day) 
being stemmed to -i (relai, arrai,kei, sai, dai) by the client side stemmer but 
not by the build-time indexer. I'm mostly done, but I think it still overstems 
words like arsenal.

http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/stemmers/en_stemmer.js?r1=9067&r2=9178

Basically, nothing from the section "Exceptional forms in general" was 
implemented and step 1c was incorrectly implemented:
http://snowball.tartarus.org/algorithms/english/stemmer.html

Regarding nucleus etc., I've also committed a fix from a colleague that should 
always check the index for the full unstemmed word to catch those Latinate 
terms that are handled correctly by the indexer but not the client side stemmer:

http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/nwSearchFnt.js?r1=9105&r2=9172

He's also working on always searching the index for things that look like 
filenames (e.g. build.xml, which it currently tokenizes to 'build' and 'xml').

Here's a demo of the current state of things:

http://www.thingbag.net/docbook/docs/content/ch05s01.html

You can grab the en_stemmer.js and use it now. The nwSearchFnt.js file also has 
changes related to adding search weighting to the results, so you'd need to 
take changes from it more carefully.

We should have a release of the xsls out before too long though.

Thanks,
David

On 01/10/2012 07:33 PM, Bort, Paul wrote:
> Hi,
> 
> I found the conversation about problems with the stemmer used with 
> English at 
> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
>
> 
very informative in tracking down the problem I'm having with the
> stemmer, which is similar. In my case, the word that isn't being 
> stemmed correctly is "relay".(It comes out as "relai".) This does 
> break searches: searching for "relay" in a document that should have 
> six matches returns an error "Your search returned no results for 
> relai".
> 
> The solution that I've implemented locally, and offer below for your 
> consideration, is a list of words to be stemmed manually. I've tried 
> to follow your coding style but I'm not a serious JavaScript hacker so 
> I may have stepped on some toes inadvertently.
> 
> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
> [email protected]
> 
> ----------------------------------
> 
> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = "^("
> + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1 mgr1 =
> "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is m>1 s_v =
> "^(" + C + ")?" + v;                   // vowel in stem + +    var
> exceptionWords = { +            "relay":"relay", +
> "relaying":"relay", +            "relays":"relay", +
> "nucleus":"nucleus", +            "zeus":"zeus" +        };
> 
> return function (w) { var     stem, @@ -67,6 +75,8 @@
> 
> if (w.length < 3) { return w; }
> 
> +        if (w in exceptionWords) { return exceptionWords{w}; } +
>  firstch = w.substr(0,1); if (firstch == "y") { w =
> firstch.toUpperCase() + w.substr(1);
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPDRXxAAoJEMHeSXG7afUh9iQH/2wcuq+ovkT5gjhhJq58ZFXm
hy9jcNruCQMRO9Nw8iozUKZjvqcaG4rHfZpmO6pyT574FQ5n4IBJRam24AcJZVrj
gY2LMeckMwkQzIuuH9xvKAXUCp13bxdL66R1ZrsPowQ/vGpxMqUZmPg8bAsJu9DL
4vxFR5vt7S2T5xLAh2kWMHz+uKC33QNl7kuh9bpVZDi/EmZIG91gvNGsFGDGqMVY
bniHVYDqYxJwYYzTHcD+lmylIwfyeqjFzrO+FDzH5/TJ/lCxyhd365je+FdMia1g
0QK0H5j90sSHBtkIPro5HVyv+sw2RTs7eB9GCROLUJKDX310efNcOLTPk3uWmuc=
=Zvxp
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to