I'm sure the problem is with the filter because I checked the words indexed 
with and without the filter (they're coming from PDF files and I also inspected 
their text counterparts). I looked at it and there is a Stemming filter for 
Brazilian Portuguese (which was not added to DSpace), but it doesn't work very 
well. 

What intrigues me the most is the same words getting indexed in more than one 
way. I think it's possibly not even language related issue. Anyway, I would 
suggest that it could be added a way to turn off the stemming filter for people 
using DSpace over languages it doesn't support. Most people are not even aware 
that this is happening (I found it almost by accident).

Regards,
Afonso Araujo Neto



-----Mensagem original-----
De: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Em nome de Dan Scott
Enviada em: sábado, 27 de janeiro de 2007 14:54
Para: [email protected]
Assunto: Re: [Dspace-tech] Lucene issue

Ages ago when I was working with Lucene in the Eclipse help system,
Lucene's stemming algorithm only supported English and German. It
hadn't been taught to understand any other languages, and therefore
was prone to mangling other languages. I'm not sure where it stands
right now.

The other, more likely possibility is that the text that Lucene is
working against might have been mangled when it was converted from the
source format. For example, consider hypothetical HTML source:

<span class="left">left</span><span class="right">right</span>

The CSS for the HTML file might have clearly separated the words
"left" and "right" on the rendered HTML page... but the conversion
from HTML to text will probably smash the word together as "leftright"
because there's no reason to treat adjacent span tags as word breaks
-- which will obviously affect your search results.

Try "*artifici*" to see if that returns the hits that you expected.

Dan Scott

On 27/01/07, Afonso Comba de Araujo Neto <[EMAIL PROTECTED]> wrote:
>
>
> I'm sure it's not random, the problem is that I couldn't figure out
> the pattern. It might be a language issue, as I'm indexing documents
> written in portuguese.
>
> The real problem is the fact that sometimes the same words get cut and
> sometimes not. For exemple, the word "artificial" (which translates to
> the exact same english word) gets (with no discernible pattern)
> indexed as "artifici" and "artificial", and that is reflected on
> search. It in fact returns some documents for "artifici" and other
> ones for "artificial", even though there is no document with the word
> "artifici". There is no hits for "artificia" for example (I was just
> messing aroud with it).
>
> As I said, it could be a language problem. But as of now I have no
> other fixes than removing this filter. Maybe it's possible to
> configure the stemming?
>
> Suggestions are totally welcome. Is anyone else experiencing this?
>
>
> Regards,
> Afonso Araujo Neto
>
>
>
> Citando "LeVan,Ralph" <[EMAIL PROTECTED]>:
>
> > I doubt that it is "randomly" eating those last characters.
> >
> >
> >
> > The stemmer is removing the endings of words to get back to a common
> > stem.  So, if your document has the word "dogs", it will generate the
> > term "dog".  Similarly, "dogged" and "dogging" will probably stem back
> > to dog as well.
> >
> >
> >
> > Sometimes this results in some pretty funny looking words in the index,
> > but no one sees the Lucene index (unless you're using the SRW interface,
> > which does provide the ability to browse the Lucene indexes).
> >
> >
> >
> > The trick is that any search terms you use will also be stemmed, so that
> > you end up searching against the same term as in the index.  This should
> > be happening automatically, since you have stemming turned on.
> >
> >
> >
> > We have now reached the limits of my Lucene knowledge.
> >
> >
> >
> > Ralph
> >
> >
> >
> > ________________________________
> >
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of Afonso
> > Comba de Araujo Neto
> > Sent: Friday, January 26, 2007 7:33 AM
> > To: [email protected]
> > Subject: [Dspace-tech] Lucene issue
> >
> >
> >
> > Hi,
> >
> >
> >
> > I'm studying Dspace and I just confirmed that the
> > org.apache.lucene.analysis.PorterStemFilter filter of Lucene (which is
> > present in the latest source of DSAnalyzer.java) is randomly eating the
> > last character of some of the indexed words. If I remove it, everything
> > indexes fine.
> >
> >
> >
> > What's the purpose of this filter? The Lucene documentation isn't much
> > explanatory.
> >
> >
> >
> > Regards,
> >
> > Afonso
> >
> >
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to