David,
These results are not surprising to me.
Old English is just a non-supported language and it
shares little with (modern) English.
From the server's point of view, handling Old English is
no different than handling Klingon. Stemming requires
a language specific knowledge, so it doesn't work with
Old English, but wilecard is independent from the
language, so it works, I think.
--------
Basis Technology Corporation, San Francisco
T. "Kuro" Kurosaka
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> David Sewell
> Sent: Wednesday, March 18, 2009 8:39 AM
> To: General XQZone Discussion
> Cc: Markus Flatscher
> Subject: [MarkLogic Dev General] MarkLogic language modules
> and pre-modern forms (attn: Mary Holstege)
>
> [This is a followup to my question the other day about using
> 2-alpha vs.
> 3-alpha codes for @xml:lang when using language options in
> MarkLogic Server. As Mary Holstege is the language guru at ML
> I hope she'll weigh in on the following, but I thought I'd
> share it with the list as it has to do with some interesting
> capabilities of ML Server's natural language handling.]
>
> We've started marking up some historic forms of languages,
> stuff like middle French (xml:lang="frm") and Old English
> (xml:lang="ang"). Based on the test queries I've just run, it
> appears that the following behavior is the case for MarkLogic
> Server's natural language handling.
>
> Assume as underlying data the following snippet of Old English:
>
> <foreign xml:lang="ang">Gif hwa ȝefeohte on Cyninȝes huse...</foreign>
>
> then
>
> 1. Stemmed searches on pre-modern language forms return null results.
>
> cts:word-query(
> 'Cyninȝ',
> ('lang=ang', 'stemmed')
> )
>
> ==> empty (possessive "Cyninȝes" not recognized as a stemmed form)
>
> 2. Non-stemmed searches on pre-modern forms return desired results:
>
> cts:word-query(
> 'Cyninȝes',
> ('lang=ang', 'exact')
> )
>
> ==> matches the data
>
> 3. Non-stemmed search on modern language includes pre-modern
> forms, e.g.
>
> cts:word-query(
> 'cyninȝ*',
> ('lang=en', 'wildcarded', 'case-insensitive')
> )
>
> ==> matches the data (ignores case, wildcards)
>
> Result #3 was a pleasant surprise as it means that people can
> do a wildcard search and still retrieve results from any
> historical period of a language.
>
> For English, Old English (ang) and Middle English (enm) are
> supported; for French, Old French (fro) and Middle French (frm).
>
> DS
> --
> David Sewell, Editorial and Technical Manager ROTUNDA, The
> University of Virginia Press PO Box 801079, Charlottesville,
> VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: [email protected] Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general