I'm sure it's not random, the problem is that I couldn't figure out the pattern. It might be a language issue, as I'm indexing documents written in portuguese.
The real problem is the fact that sometimes the same words get cut and sometimes not. For exemple, the word "artificial" (which translates to the exact same english word) gets (with no discernible pattern) indexed as "artifici" and "artificial", and that is reflected on search. It in fact returns some documents for "artifici" and other ones for "artificial", even though there is no document with the word "artifici". There is no hits for "artificia" for example (I was just messing aroud with it). As I said, it could be a language problem. But as of now I have no other fixes than removing this filter. Maybe it's possible to configure the stemming? Suggestions are totally welcome. Is anyone else experiencing this? Regards, Afonso Araujo Neto Citando "LeVan,Ralph" <[EMAIL PROTECTED]>: > I doubt that it is "randomly" eating those last characters. > > > > The stemmer is removing the endings of words to get back to a common > stem. So, if your document has the word "dogs", it will generate the > term "dog". Similarly, "dogged" and "dogging" will probably stem back > to dog as well. > > > > Sometimes this results in some pretty funny looking words in the index, > but no one sees the Lucene index (unless you're using the SRW interface, > which does provide the ability to browse the Lucene indexes). > > > > The trick is that any search terms you use will also be stemmed, so that > you end up searching against the same term as in the index. This should > be happening automatically, since you have stemming turned on. > > > > We have now reached the limits of my Lucene knowledge. > > > > Ralph > > > > ________________________________ > > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Afonso > Comba de Araujo Neto > Sent: Friday, January 26, 2007 7:33 AM > To: [email protected] > Subject: [Dspace-tech] Lucene issue > > > > Hi, > > > > I'm studying Dspace and I just confirmed that the > org.apache.lucene.analysis.PorterStemFilter filter of Lucene (which is > present in the latest source of DSAnalyzer.java) is randomly eating the > last character of some of the indexed words. If I remove it, everything > indexes fine. > > > > What's the purpose of this filter? The Lucene documentation isn't much > explanatory. > > > > Regards, > > Afonso > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

