This is an automated notification sent by LCG Savannah.
It relates to:
task #10089, project CDS Invenio
==============================================================================
LATEST MODIFICATIONS of task #10089:
==============================================================================
Update of task #10089 (project cdsware):
Status: None => Duplicate
Percent Complete: 0% => 100%
Open/Closed: Open => Closed
_______________________________________________________
Follow-up Comment #1:
Moved to http://invenio-software.org/ticket/285
==============================================================================
OVERVIEW of task #10089:
==============================================================================
URL:
<http://savannah.cern.ch/task/?10089>
Summary: BibIndex: add support for CJK languages
Project: CDS Invenio
Submitted by: simko
Submitted on: 2009-06-12 14:21
Should Start On: 2009-06-12 00:00
Should be Finished on: 2009-06-12 00:00
Category: BibIndex
Priority: 5 - Normal
Status: Duplicate
Privacy: Public
Percent Complete: 100%
Assigned to: None
Open/Closed: Closed
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
BibIndex's phrase segmenters (get_words_from_phrase() and friends)
should be made more CJK friendly, when a new config variable named
like CFG_BIBINDEX_CJK_SUPPORT is set to 1.
(Later on, this behaviour could be configured per index, or even per
MARC field, in case there are records containing many languages.
Well, it does not hurt to do CJK recognition for all the fields all
the time, by default -- but it would slower down the indexer a bit due
to CJK Unicode zone check for ever character. So we have an interest
to have some CFG variable for this anyway.)
What has to be done: the usual get_words_from_xxx() return blocks that
can be treated as words usually, but for CJK languages we need to
break them down further. When we see an input string ABC where A, B,
and C are characters from the CJK zone, then we should index
separately A, B, and C as if they were standalone words. Then, on the
retrieval side, we should break the user query in the same way, and
use the boolean `and' to find the matching records. This will improve
the typical CJK search accuracy a lot.
(Later on, we may need to pay closer attention to `word' positions for
this to work really well.)
Note that this seems to be what mnoGoSearch's CJK phrase segmenter
does. <http://www.mnogosearch.org/doc33/msearch-cjk.html>
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: 2010-09-14 14:38 By: Tibor Simko <simko>
Moved to http://invenio-software.org/ticket/285
_______________________________________________________
Carbon-Copy List:
CC Address | Comment
------------------------------------+-----------------------------
1576 | -SUB-
==============================================================================
This item URL is:
<http://savannah.cern.ch/task/?10089>
_______________________________________________
Message sent via/by LCG Savannah
http://savannah.cern.ch/