Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using exactly, but it seems fairly sophisticated (there's even an advanced option for multiple stemming: e.g., "further" has "far," "farther," "further" as stems).
All best, Tim -- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith <[email protected]> wrote: > Hi Tim - > > On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson <[email protected]> wrote: > >> I'm currently involved in a project that's using MarkLogic, and I noticed >> that its implementation of English-language stemming differs from that of >> BaseX: e.g., "mouse" and "mice" both stem to "mouse." >> >> In BaseX, those words are stemmed separately. Is this a known limitation >> of the internal English syntax parser? >> >> It's my (admittedly, *VERY*) limited understanding that the BaseX > stemmer, at least for English, is limited to the Porter Stemmer[1]. The > Porter Stemmer just stems, and doesn't handle stemming from plurals to > singulars in the case of apophonic plurals. > > It'd be interesting to learn what stemmer(s) MarkLogic uses. > > And, while I'm not that familiar with it (and it would probably entail > significant work to implement), the `ft:thesaurus()` function provides > similar functionality: > ``` > ft:thesaurus( > <thesaurus> > <entry> > <term>mice</term> > <synonym> > <term>mouse</term> > <relationship>NT</relationship> > </synonym> > <synonym> > <term>rodent</term> > <relationship>BTG</relationship> > </synonym> > </entry> > </thesaurus>, > 'mice' > ) > ``` > > >> Example: >> >> db:create("stem-test", >> <data> >> <x>mouse</x> >> <y>mice</y> >> </data> >> , "data", map {"ftindex": true(), "stemming": true(), "language": "en"} >> ) >> , >> update:output( >> ft:search("stem-test", "mice") >> ) >> >> >> Thanks, >> Tim >> >> >> > Best, > Bridger > > [1] > https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa51/basex-core/src/main/java/org/basex/util/ft/EnglishStemmer.java > > > -- >> Tim A. Thompson (he, him) >> Librarian for Applied Metadata Research >> Yale University Library >> >>

