[This is a followup to my question the other day about using 2-alpha vs.
3-alpha codes for @xml:lang when using language options in MarkLogic
Server. As Mary Holstege is the language guru at ML I hope she'll weigh
in on the following, but I thought I'd share it with the list as it has
to do with some interesting capabilities of ML Server's natural language
handling.]
We've started marking up some historic forms of languages, stuff like
middle French (xml:lang="frm") and Old English (xml:lang="ang"). Based
on the test queries I've just run, it appears that the following
behavior is the case for MarkLogic Server's natural language handling.
Assume as underlying data the following snippet of Old English:
<foreign xml:lang="ang">Gif hwa ȝefeohte on Cyninȝes huse...</foreign>
then
1. Stemmed searches on pre-modern language forms return null results.
cts:word-query(
'Cyninȝ',
('lang=ang', 'stemmed')
)
==> empty (possessive "Cyninȝes" not recognized as a stemmed form)
2. Non-stemmed searches on pre-modern forms return desired results:
cts:word-query(
'Cyninȝes',
('lang=ang', 'exact')
)
==> matches the data
3. Non-stemmed search on modern language includes pre-modern forms, e.g.
cts:word-query(
'cyninȝ*',
('lang=en', 'wildcarded', 'case-insensitive')
)
==> matches the data (ignores case, wildcards)
Result #3 was a pleasant surprise as it means that people can do a
wildcard search and still retrieve results from any historical period of a
language.
For English, Old English (ang) and Middle English (enm) are supported;
for French, Old French (fro) and Middle French (frm).
DS
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 801079, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general