RE: [MarkLogic Dev General] MarkLogic language modules and pre-modern forms (attn: Mary Holstege)

Teruhiko Kurosaka Tue, 24 Mar 2009 10:17:38 -0700

David,
These results are not surprising to me.
Old English is just a non-supported language and it 
shares little with (modern) English.
From the server's point of view, handling Old English is 
no different than handling Klingon. Stemming requires
a language specific knowledge, so it doesn't work with
Old English, but wilecard is independent from the
language, so it works, I think.


--------
Basis Technology Corporation, San Francisco
T. "Kuro" Kurosaka 
  

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> David Sewell
> Sent: Wednesday, March 18, 2009 8:39 AM
> To: General XQZone Discussion
> Cc: Markus Flatscher
> Subject: [MarkLogic Dev General] MarkLogic language modules 
> and pre-modern forms (attn: Mary Holstege)
> 
> [This is a followup to my question the other day about using 
> 2-alpha vs.
> 3-alpha codes for @xml:lang when using language options in 
> MarkLogic Server. As Mary Holstege is the language guru at ML 
> I hope she'll weigh in on the following, but I thought I'd 
> share it with the list as it has to do with some interesting 
> capabilities of ML Server's natural language handling.]
> 
> We've started marking up some historic forms of languages, 
> stuff like middle French (xml:lang="frm") and Old English 
> (xml:lang="ang"). Based on the test queries I've just run, it 
> appears that the following behavior is the case for MarkLogic 
> Server's natural language handling.
> 
> Assume as underlying data the following snippet of Old English:
> 
> <foreign xml:lang="ang">Gif hwa ȝefeohte on Cyninȝes huse...</foreign>
> 
> then
> 
> 1. Stemmed searches on pre-modern language forms return null results.
> 
>   cts:word-query(
>     'Cyninȝ',
>     ('lang=ang', 'stemmed')
>   )
> 
>   ==> empty (possessive "Cyninȝes" not recognized as a stemmed form)
> 
> 2. Non-stemmed searches on pre-modern forms return desired results:
> 
>   cts:word-query(
>     'Cyninȝes',
>     ('lang=ang', 'exact')
>    )
> 
>   ==> matches the data
> 
> 3. Non-stemmed search on modern language includes pre-modern 
> forms, e.g.
> 
> cts:word-query(
>     'cyninȝ*',
>     ('lang=en', 'wildcarded', 'case-insensitive')
>    )
> 
>   ==> matches the data (ignores case, wildcards)
> 
> Result #3 was a pleasant surprise as it means that people can 
> do a wildcard search and still retrieve results from any 
> historical period of a language.
> 
> For English, Old English (ang) and Middle English (enm) are 
> supported; for French, Old French (fro) and Middle French (frm).
> 
> DS
> --
> David Sewell, Editorial and Technical Manager ROTUNDA, The 
> University of Virginia Press PO Box 801079, Charlottesville, 
> VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: [email protected]   Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] MarkLogic language modules and pre-modern forms (attn: Mary Holstege)

Reply via email to