[task #10089] BibIndex: add support for CJK languages

noreply [Tibor Simko] Tue, 14 Sep 2010 16:38:36 +0200

This is an automated notification sent by LCG Savannah.
It relates to:
                task #10089, project CDS Invenio


==============================================================================
 LATEST MODIFICATIONS of task #10089:
==============================================================================

Update of task #10089 (project cdsware):

                  Status:                    None => Duplicate              
        Percent Complete:                      0% => 100%                   
             Open/Closed:                    Open => Closed                 

    _______________________________________________________

Follow-up Comment #1:

Moved to http://invenio-software.org/ticket/285

==============================================================================
 OVERVIEW of task #10089:
==============================================================================

URL:
  <http://savannah.cern.ch/task/?10089>

                 Summary: BibIndex: add support for CJK languages
                 Project: CDS Invenio
            Submitted by: simko
            Submitted on: 2009-06-12 14:21
         Should Start On: 2009-06-12 00:00
   Should be Finished on: 2009-06-12 00:00
                Category: BibIndex
                Priority: 5 - Normal
                  Status: Duplicate
                 Privacy: Public
        Percent Complete: 100%
             Assigned to: None
             Open/Closed: Closed
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________


BibIndex's phrase segmenters (get_words_from_phrase() and friends)
should be made more CJK friendly, when a new config variable named
like CFG_BIBINDEX_CJK_SUPPORT is set to 1.

(Later on, this behaviour could be configured per index, or even per
MARC field, in case there are records containing many languages.
Well, it does not hurt to do CJK recognition for all the fields all
the time, by default -- but it would slower down the indexer a bit due
to CJK Unicode zone check for ever character.  So we have an interest
to have some CFG variable for this anyway.)

What has to be done: the usual get_words_from_xxx() return blocks that
can be treated as words usually, but for CJK languages we need to
break them down further.  When we see an input string ABC where A, B,
and C are characters from the CJK zone, then we should index
separately A, B, and C as if they were standalone words.  Then, on the
retrieval side, we should break the user query in the same way, and
use the boolean `and' to find the matching records.  This will improve
the typical CJK search accuracy a lot.

(Later on, we may need to pay closer attention to `word' positions for
this to work really well.)

Note that this seems to be what mnoGoSearch's CJK phrase segmenter
does.  <http://www.mnogosearch.org/doc33/msearch-cjk.html>


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: 2010-09-14 14:38              By: Tibor Simko <simko>
Moved to http://invenio-software.org/ticket/285





    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
1576                                | -SUB-




==============================================================================

This item URL is:
  <http://savannah.cern.ch/task/?10089>

_______________________________________________
  Message sent via/by LCG Savannah
  http://savannah.cern.ch/

[task #10089] BibIndex: add support for CJK languages

Reply via email to