I doubt that it is "randomly" eating those last characters.

 

The stemmer is removing the endings of words to get back to a common
stem.  So, if your document has the word "dogs", it will generate the
term "dog".  Similarly, "dogged" and "dogging" will probably stem back
to dog as well.

 

Sometimes this results in some pretty funny looking words in the index,
but no one sees the Lucene index (unless you're using the SRW interface,
which does provide the ability to browse the Lucene indexes).

 

The trick is that any search terms you use will also be stemmed, so that
you end up searching against the same term as in the index.  This should
be happening automatically, since you have stemming turned on.

 

We have now reached the limits of my Lucene knowledge.

 

Ralph

 

________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Afonso
Comba de Araujo Neto
Sent: Friday, January 26, 2007 7:33 AM
To: [email protected]
Subject: [Dspace-tech] Lucene issue

 

Hi,

 

I'm studying Dspace and I just confirmed that the
org.apache.lucene.analysis.PorterStemFilter filter of Lucene (which is
present in the latest source of DSAnalyzer.java) is randomly eating the
last character of some of the indexed words. If I remove it, everything
indexes fine. 

 

What's the purpose of this filter? The Lucene documentation isn't much
explanatory.

 

Regards,

Afonso

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to