Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using
exactly, but it seems fairly sophisticated (there's even an advanced option
for multiple stemming: e.g., "further" has "far," "farther," "further" as
stems).

All best,
Tim


-- 
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research
Yale University Library



On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith <[email protected]>
wrote:

> Hi Tim  -
>
> On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson <[email protected]> wrote:
>
>> I'm currently involved in a project that's using MarkLogic, and I noticed
>> that its implementation of English-language stemming differs from that of
>> BaseX: e.g., "mouse" and "mice" both stem to "mouse."
>>
>> In BaseX, those words are stemmed separately. Is this a known limitation
>> of the internal English syntax parser?
>>
>> It's my (admittedly, *VERY*) limited understanding that the BaseX
> stemmer, at least for English, is limited to the Porter Stemmer[1]. The
> Porter Stemmer just stems, and doesn't handle stemming from plurals to
> singulars in the case of apophonic plurals.
>
> It'd be interesting to learn what stemmer(s) MarkLogic uses.
>
> And, while I'm not that familiar with it (and it would probably entail
> significant work to implement), the `ft:thesaurus()` function provides
> similar functionality:
> ```
> ft:thesaurus(
>   <thesaurus>
>     <entry>
>       <term>mice</term>
>       <synonym>
>         <term>mouse</term>
>         <relationship>NT</relationship>
>       </synonym>
>       <synonym>
>         <term>rodent</term>
>         <relationship>BTG</relationship>
>       </synonym>
>     </entry>
>   </thesaurus>,
>   'mice'
> )
> ```
>
>
>> Example:
>>
>> db:create("stem-test",
>>   <data>
>>     <x>mouse</x>
>>     <y>mice</y>
>>   </data>
>>   , "data", map {"ftindex": true(), "stemming": true(), "language": "en"}
>> )
>> ,
>> update:output(
>>   ft:search("stem-test", "mice")
>> )
>>
>>
>> Thanks,
>> Tim
>>
>>
>>
> Best,
> Bridger
>
> [1]
> https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa51/basex-core/src/main/java/org/basex/util/ft/EnglishStemmer.java
>
>
> --
>> Tim A. Thompson (he, him)
>> Librarian for Applied Metadata Research
>> Yale University Library
>>
>>

Reply via email to