#285: BibIndex: add support for CJK languages
-------------------------+--------------------------------------------------
 Reporter:  simko        |       Owner:     
     Type:  enhancement  |      Status:  new
 Priority:  minor        |   Milestone:     
Component:  BibIndex     |     Version:     
 Keywords:               |  
-------------------------+--------------------------------------------------
 BibIndex's phrase segmenters (get_words_from_phrase() and friends)
 should be made more CJK friendly, when a new config variable named
 like CFG_BIBINDEX_CJK_SUPPORT is set to 1.

 (Later on, this behaviour could be configured per index, or even per
 MARC field, in case there are records containing many languages.
 Well, it does not hurt to do CJK recognition for all the fields all
 the time, by default -- but it would slower down the indexer a bit due
 to CJK Unicode zone check for ever character. So we have an interest
 to have some CFG variable for this anyway.)

 What has to be done: the usual get_words_from_xxx() return blocks that
 can be treated as words usually, but for CJK languages we need to
 break them down further. When we see an input string ABC where A, B,
 and C are characters from the CJK zone, then we should index
 separately A, B, and C as if they were standalone words. Then, on the
 retrieval side, we should break the user query in the same way, and
 use the boolean `and' to find the matching records. This will improve
 the typical CJK search accuracy a lot.

 (Later on, we may need to pay closer attention to `word' positions for
 this to work really well.)

 Note that this seems to be what mnoGoSearch's CJK phrase segmenter
 does. [[http://www.mnogosearch.org/doc33/msearch-cjk.html]]

-- 
Ticket URL: <http://invenio-software.org/ticket/285>
Invenio <http://invenio-software.org>

Reply via email to