On Wed, 07 May 2008 14:31:16 -0700, Marc Moskowitz
<[EMAIL PROTECTED]> wrote:
I'm seeing some odd behavior when searching for text in Chinese. It
seems that the server is making decisions about word boundaries based on
some internal criteria.
This XQuery:
let $q := '意',
$doc := (
<yo>好意思</yo>,
<yo>意料</yo>,
<yo>好意</yo>,
<yo>词不达達意</yo>)
for $d in $doc
let $h := cts:highlight($d, $q, <hey>{$cts:text}</hey>)
return (count($h//hey), $h)
produces this result:
0
<yo>好意思</yo>
1
<yo><hey>意</hey>料</yo>
0
<yo>好意</yo>
1
<yo>词不达達<hey>意</hey></yo>
Is there some way of affecting where these boundaries are placed? Or of
turning this functionality fully on or off?
-Marc
The reason you're seeing this is that the rules of Chinese tokenization
say that 意 is part of a longer token/word in the 1st and 3rd yos.
It is analogous to looking for "black" and wanting a hit on "blackbird".
If you use a license that has no Chinese support, then the
non-language-aware tokenization kicks in and every character is
treated as a distinct word.
//Mary
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general