Re: [Dspace-tech] Lucene issue

Afonso Comba de Araujo Neto Sat, 27 Jan 2007 05:30:28 -0800


I'm sure it's not random, the problem is that I couldn't figure out  
the pattern. It might be a language issue, as I'm indexing documents  
written in portuguese.


The real problem is the fact that sometimes the same words get cut and  
sometimes not. For exemple, the word "artificial" (which translates to  
the exact same english word) gets (with no discernible pattern)  
indexed as "artifici" and "artificial", and that is reflected on  
search. It in fact returns some documents for "artifici" and other  
ones for "artificial", even though there is no document with the word  
"artifici". There is no hits for "artificia" for example (I was just  
messing aroud with it).

As I said, it could be a language problem. But as of now I have no  
other fixes than removing this filter. Maybe it's possible to  
configure the stemming?

Suggestions are totally welcome. Is anyone else experiencing this?


Regards,
Afonso Araujo Neto



Citando "LeVan,Ralph" <[EMAIL PROTECTED]>:

> I doubt that it is "randomly" eating those last characters.
>
>
>
> The stemmer is removing the endings of words to get back to a common
> stem.  So, if your document has the word "dogs", it will generate the
> term "dog".  Similarly, "dogged" and "dogging" will probably stem back
> to dog as well.
>
>
>
> Sometimes this results in some pretty funny looking words in the index,
> but no one sees the Lucene index (unless you're using the SRW interface,
> which does provide the ability to browse the Lucene indexes).
>
>
>
> The trick is that any search terms you use will also be stemmed, so that
> you end up searching against the same term as in the index.  This should
> be happening automatically, since you have stemming turned on.
>
>
>
> We have now reached the limits of my Lucene knowledge.
>
>
>
> Ralph
>
>
>
> ________________________________
>
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Afonso
> Comba de Araujo Neto
> Sent: Friday, January 26, 2007 7:33 AM
> To: [email protected]
> Subject: [Dspace-tech] Lucene issue
>
>
>
> Hi,
>
>
>
> I'm studying Dspace and I just confirmed that the
> org.apache.lucene.analysis.PorterStemFilter filter of Lucene (which is
> present in the latest source of DSAnalyzer.java) is randomly eating the
> last character of some of the indexed words. If I remove it, everything
> indexes fine.
>
>
>
> What's the purpose of this filter? The Lucene documentation isn't much
> explanatory.
>
>
>
> Regards,
>
> Afonso
>
>



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Lucene issue

Reply via email to