Hmm, to follow up to my own query it seems that in case #3 below, a
non-stemmed search matches no matter what language is specified. So it's
not an artefact of grouping different forms of a language. Anyway, still
good to know that granular tagging for languages can co-exist with
various search scenarios.

David

On Wed, 18 Mar 2009, David Sewell wrote:

> [This is a followup to my question the other day about using 2-alpha vs.
> 3-alpha codes for @xml:lang when using language options in MarkLogic
> Server. As Mary Holstege is the language guru at ML I hope she'll weigh
> in on the following, but I thought I'd share it with the list as it has
> to do with some interesting capabilities of ML Server's natural language
> handling.]
>
> We've started marking up some historic forms of languages, stuff like
> middle French (xml:lang="frm") and Old English (xml:lang="ang"). Based
> on the test queries I've just run, it appears that the following
> behavior is the case for MarkLogic Server's natural language handling.
>
> Assume as underlying data the following snippet of Old English:
>
> <foreign xml:lang="ang">Gif hwa ȝefeohte on Cyninȝes huse...</foreign>
>
> then
>
> 1. Stemmed searches on pre-modern language forms return null results.
>
>   cts:word-query(
>     'Cyninȝ',
>     ('lang=ang', 'stemmed')
>   )
>
>   ==> empty (possessive "Cyninȝes" not recognized as a stemmed form)
>
> 2. Non-stemmed searches on pre-modern forms return desired results:
>
>   cts:word-query(
>     'Cyninȝes',
>     ('lang=ang', 'exact')
>    )
>
>   ==> matches the data
>
> 3. Non-stemmed search on modern language includes pre-modern forms, e.g.
>
> cts:word-query(
>     'cyninȝ*',
>     ('lang=en', 'wildcarded', 'case-insensitive')
>    )
>
>   ==> matches the data (ignores case, wildcards)
>
> Result #3 was a pleasant surprise as it means that people can do a
> wildcard search and still retrieve results from any historical period of a
> language.
>
> For English, Old English (ang) and Middle English (enm) are supported;
> for French, Old French (fro) and Middle French (frm).
>
> DS
>

-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 801079, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to