It would be hard to do segmentation for CJK lexically. See: http://www2.computer.org/portal/web/csdl/doi/10.1109/ETCS.2009.240
however, a simple word segmentation can help. i.e. if the UTF8 character falls into CJK code zone, your indexer will segment the sentence wordly. -----Original Message----- From: Tibor Simko [mailto:[email protected]] Sent: Tuesday, June 09, 2009 8:27 PM To: KU Kam-ming Cc: project-cdsware-users (CDSware users list.) Subject: Re: invenio indexes CJK ? On Sat, 06 Jun 2009, KU Kam-ming wrote: > Could Invenio index Chinese (multi-byte) characters? It seems that > bibindex will segment a string by space, however, this is not > applicable for CJK. There is no problem in having multi-byte UTF-8 characters in Invenio; but indeed, as you said, Invenio's default word breaking procedures are not CJK friendly. The phrase search or the regexp search would be the only usable matching options left in such a setup. That said, it is possible to customize the word breaking procedures in Invenio's' workflow. Can you suggest us some nicely working CJK savvy library that would split phrases into words? Preferably in Python or C? Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
