HI Vanessa,

 

You’re correct. Dspace uses Apache Lucene to index and analyse full text, and 
one of the things a Lucene analyser (note: all docs and code refer to the 
American spelling, analyzer) does is perform “stemming” on indexed tokens, so 
common suffixes like “ing”, “ed”, “es”, “ly” are chopped off the tokens, and 
off your search terms as well.

It’s possible to write your own Lucene analyser or extend your own, and replace 
the default DSAnalyzer that Dspace uses with your version. I’ve done this in my 
installations, not for stemming, but to properly tokenise macronised vowels (ā 
ē ī ō ū) that are used in New Zealand but aren’t in supported ISO character 
sets.

 

This page might help explain the concepts better than myself: 
http://wiki.apache.org/lucene-java/ConceptsAndDefinitions?highlight=(stemmer)

 

This is a quote from the Dspace system docs:

 

The Lucene analyzer used in searching and indexing can be configured by setting 
the search.analyzer configuration item in dspace.cfg to the class of the 
desired analzyer. If this item is not present/commented out, the default Lucene 
analyzer org.dspace.search.DSAnalyzer is used.

As well as those analyzers included in the Lucene distribution (see 
lucene.jar), a Chinese analyzer from the Lucene sandbox is included in 
lucene-sandbox.jar. This analyzer is yet to be included in the core Lucene 
distribution but can be configured by setting search.analyzer = 
org.apache.lucene.analysis.cn.ChineseAnalyzer in dspace.cfg.

Cheers,

 

Kim

 

From: dspace-general-boun...@mit.edu [mailto:dspace-general-boun...@mit.edu] On 
Behalf Of Vanessa Barrett
Sent: Friday, 18 September 2009 4:30 p.m.
To: dspace-general@mit.edu
Subject: [Dspace-general] Indexing - automatic mapping of plurals andalternate 
endings?

 

Can anyone confirm my understanding of how DSpace performs keyword 
indexing/searching? I suspect that it is doing automatic mapping of singular 
and plural forms of words.

 

How I came to this understanding was as follows.

I was searching for an item authored by Alys (alternate spelling of Alice) 
Clark.

 

I retrieved three items none of which had the word alys in the metadata or 
bitstream.  If I searched for alys on its own I got 168 hits and a cursory 
glance at the results list showed that they all had an author with some part of 
their name being ali.

 

I then tried searching for each of the following forms

aly, ali, alis, alys alies 

 

All of these as single search terms retrieved exactly the same number of 
records – 168.  Results included items with the following strings in Abstract

- ALIS (Advanced Landmine Imaging System), which is a novel landmine detection 
sensor system

- Current ventilatory practices for the management of ALI favor low tidal 
volumes

- Current Trends in Periodontal Diagnosis & Disease Recognition in Malaysia / 
T.B. Taiyeb Ali

- Radiology in the acute abdomen / P.G. Devitt, A. Aly, M. Thomas

 

My conclusion is that DSpace is doing some process of mapping plural to 
singular forms of words including allowing for alternate endings.  If it is 
doing this it is very clever but just a little annoying as Alys is not the 
plural of Ali.

 

Also if clever enough to do this why can’t it map fiber to fibre and color to 
colour which would have much greater benefits in searching a database that 
includes North American and European data.

 

Cheers, 

Vanessa Barrett
Digital Services Librarian
The University of Adelaide, AUSTRALIA 5005
Ph    : +61 8 8303 4625
e-mail: vanessa.barr...@adelaide.edu.au

CRICOS Provider Number 00123M
-----------------------------------------------------------
IMPORTANT: This message may contain confidential or legally privileged 
information. If you think it was sent to you by mistake, please delete all 
copies and advise the sender. For the purposes of the SPAM Act 2003, this email 
is authorised by The University of Adelaide. 

Think green: read on the screen.

 

_______________________________________________
Dspace-general mailing list
Dspace-general@mit.edu
http://mailman.mit.edu/mailman/listinfo/dspace-general

Reply via email to