[task #5303] Synchronizing Indexing, Word Ranking and Searching with respect to Stemming

noreply [Samuele Kaplun] Thu, 24 Jan 2008 18:23:14 +0100

This is an automated notification sent by LCG Savannah.
It relates to:
                task #5303, project CDS Invenio


==============================================================================
 LATEST MODIFICATIONS of task #5303:
==============================================================================

Update of task #5303 (project cdsware):

        Percent Complete:                     80% => 100%                   
             Open/Closed:                    Open => Closed                 


==============================================================================
 OVERVIEW of task #5303:
==============================================================================

URL:
  <http://savannah.cern.ch/task/?5303>

                 Summary: Synchronizing Indexing, Word Ranking and Searching
with respect to Stemming
                 Project: CDS Invenio
            Submitted by: skaplun
            Submitted on: 2007-07-24 15:41
         Should Start On: 2007-07-24 00:00
   Should be Finished on: 2007-07-24 00:00
                Category: BibIndex
                Priority: 5 - Normal
                  Status: Done
                 Privacy: Public
        Percent Complete: 100%
             Assigned to: skaplun
             Open/Closed: Closed
         Discussion Lock: Any
                  Effort: 100.00

    _______________________________________________________


To improve searching and the user feeling of CDS Invenio, stemming is a key
feature. Stemming is good when it is applied to real dictionary word in the
correct language.
Right now the situation is:
* CDS Invenio has a default language installation.
* Word ranker knows about the language that should be contained in some
fields (inside etc/bibrank/wrd.cfg), either english or french.
I've built a list of truly text fields that could be stemmed.
Everything else should not be stemmed.
Right now stemming is applied to everything or nothing, which is bad for
fields not in the default language, for records not in the default language,
and for fields that should not be stemmed (reportnumber, year, author,
collection...)
I propose to: 
* move the language information of field inside wrd.cfg, somewhere else in a
clean position useful for indexing, ranking and searching;
* apply only stemming to the correct fields with the correct language (both
in indexing and in ranking, based on the information extracted from wrd.cfg)
* add a field to specify the global language of a record, (if it is different
from the default language)
* Indexing and ranking record non in default language, both using stemming
and both not. This produce a bit of overhead, but makes does few records
retrievable
* For searching, apply stemming in the default language when the query words
belong to a index made only of stemmed fields (abstract, title), not apply
stemming when the words belong to a index made only of not stemmed fields
(author, title); enrich the raw query with the stemmed words when the index
on which they are queried contains both stemmed and not stemmed (global).
* Add a way to specify the language of a query and then search only on
records with that language (maybe using just the new added field...)

    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: 2008-01-07 14:10              By: Samuele Kaplun <skaplun>
At the moment a new CFG_BIBINDEX_DISABLE_STEMMING_FOR_INDEXES config keyword
has been added in order to solve the issue of little-sense stemming applied
to critical indexes (like year, author, reportnumber...) Moreover the mask is
applied at index level, not at tag level, which solve the issue of Comment
#1.

It is now understood (by Sam) that the different ways of stemming between
indexing and word ranking is a feature and not a bug, hence they won't be
synchronized.

Comment #2 still need to be implemented.

-------------------------------------------------------
Date: 2007-08-03 14:44              By: Tibor Simko <simko>
Comment #1: Stemming on/off should be configurable on index-level, not on    
tag-level.  Otherwise the same word (ellis) would be treated differently
within the same index depending on whether it comes from the author name
field (Ellis, John) or from the abstract field
(the Ellis hypothesis).  They would be stemmed differently, and so not
findable when users search for 'ellis'.

Comment #2: We can have default stemming language defined per index.  E.g.
index called "title in French" will use French stemming, index called "title
in English" will use English stemming, and index called "year" will use no
stemming.  When users type their queries,  we can apply stemming according to
index where they search.
(And for the global index we define some suitable default, such as no
stemming or a stemming in the dominant language of the document corpus.)





    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
1576                                | -COM-
2195                                | -SUB-




==============================================================================

This item URL is:
  <http://savannah.cern.ch/task/?5303>

_______________________________________________
  Message sent via/by LCG Savannah
  http://savannah.cern.ch/

[task #5303] Synchronizing Indexing, Word Ranking and Searching with respect to Stemming

Reply via email to