Re: [MarkLogic Dev General] Word Boundaries in Chinese?

Mary Holstege Wed, 07 May 2008 15:47:18 -0700

On Wed, 07 May 2008 14:31:16 -0700, Marc Moskowitz<[EMAIL PROTECTED]> wrote:

I'm seeing some odd behavior when searching for text in Chinese. Itseems that the server is making decisions about word boundaries based onsome internal criteria.
This XQuery:
let $q := '意',
$doc := (
<yo>好意思</yo>,
<yo>意料</yo>,
<yo>好意</yo>,
<yo>词不达達意</yo>)
for $d in $doc
let $h := cts:highlight($d, $q, <hey>{$cts:text}</hey>)
return (count($h//hey), $h)

produces this result:

0
<yo>好意思</yo>
1
<yo><hey>意</hey>料</yo>
0
<yo>好意</yo>
1
<yo>词不达達<hey>意</hey></yo>
Is there some way of affecting where these boundaries are placed? Or ofturning this functionality fully on or off?
-Marc


The reason you're seeing this is that the rules of Chinese tokenization
say that 意 is part of a longer token/word in the 1st and 3rd yos.
It is analogous to looking for "black" and wanting a hit on "blackbird".
If you use a license that has no Chinese support, then the
non-language-aware tokenization kicks in and every character is
treated as a distinct word.

//Mary
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Word Boundaries in Chinese?

Reply via email to