cts:stem will show the alternative stems, but basic stemming will only use the
first stem given.
Stemmed search matching depends on matching stem to stem. In basic stemming,
that means matching on the first stem; in advanced stemming that means matching
on any of the stems. So, consider your words here:
cts:stem($word,"fr") =>
mourir = mourir
meurt = mourir
mourant = mourir, mourant
mourrait = mourir
Since "mourir" is the first stem for all these, they will all match each other
under basic stemming.
disparu = disparu, disparaître
disparues = disparu, disparaître
disparaître = disparaître
Since the first stem "disparu" does not match the first stem "disparaître",
"disparaître" will not match "disparu" under basic stemming although it would
under advanced stemming.
marche = marche, marcher
marcher = marcher
Since the first stem "marche" does not match the first stem "marcher",
"marche" will not match "marcher" under basic stemming although it would under
advanced stemming.
With respect to "baux" -> "bau"; "bau" is actually a word in French with the
plural of "baux", although perhaps an obscure word. But even so, in general the
stemming is a combination of dictionary information and algorithms, and you
will occasionally turn up cases where you get something that isn't actually a
word as the stem. But that doesn't really matter: what matters is whether the
stems match. If a particular word is being stemmed in a way that causes trouble
for your application, you can always add it to your custom dictionary to force
a different result.
In general I would say that if basic stemming is not giving you what you want
in terms of search recall, use advanced stemming. The need is partly dependent
on the characteristics of the language, and partly on the needs of your
application. I think French in particular is a language with a lot of words
that have the same surface forms but different underlying stems, and where the
shorter stem (which is generally the first) may not be the high probability
choice, so advanced stemming could make a big difference for some applications.
//Mary
On 04/14/2016 12:58 PM, Gontla Praveen wrote:
Hi Mary,
While testing found more when only basic stemming is enabled.
For example the term "mourir" with basic stemming enabled returns me
meurt,mourant, mourrait, mourir
let $text:= <text xml:lang="fr">marcher avec la bau rupture de baux septembre
1997, bail marche cette disparues situation bau fait disparaître la
justification. Les services fournis disparu par la demanderesse l'ont été dans
l'attente d'une rémunération,</text>
return
cts:highlight($text,cts:query(<cts:word-query>
<cts:text xml:lang="fr">mourir</cts:text>
<cts:option>case-insensitive</cts:option>
<cts:option>diacritic-insensitive</cts:option>
<cts:option>punctuation-insensitive</cts:option>
</cts:word-query>),<b>{$cts:text}</b>)
Why does not the same happens for the term disparu or marche?
Why advanced stemming required for these terms? Is it anything specific to
French language ?
Also, when i did check for stems of cts:stem("baux","fr") i get bau,bail where
bau doesnt have any meaning in french.
Since only basic stemming is enabled at my DB level i am seeing documents
contains baux or bau but not bail.
Can you tell me why this difference in bahaviour on french stems.
Thanks,
Praveen.
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general