It would be hard to do segmentation for CJK lexically. See:
http://www2.computer.org/portal/web/csdl/doi/10.1109/ETCS.2009.240

however, a simple word segmentation can help.
i.e. if the UTF8 character falls into CJK code zone, your indexer will
segment the sentence wordly.



-----Original Message-----
From: Tibor Simko [mailto:[email protected]] 
Sent: Tuesday, June 09, 2009 8:27 PM
To: KU Kam-ming
Cc: project-cdsware-users (CDSware users list.)
Subject: Re: invenio indexes CJK ?

On Sat, 06 Jun 2009, KU Kam-ming wrote:
> Could Invenio index Chinese (multi-byte) characters?  It seems that
> bibindex will segment a string by space, however, this is not
> applicable for CJK.

There is no problem in having multi-byte UTF-8 characters in Invenio;
but indeed, as you said, Invenio's default word breaking procedures are
not CJK friendly.  The phrase search or the regexp search would be the
only usable matching options left in such a setup.

That said, it is possible to customize the word breaking procedures in
Invenio's' workflow.  Can you suggest us some nicely working CJK savvy
library that would split phrases into words?  Preferably in Python or C?

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to