RE: Need pointers on using a very small part of Lucene

Alex Murzaku Fri, 15 Mar 2002 04:51:00 -0800

You should keep in mind though that frequency plays a role in computing
the score. If you eliminate repeated words, you will have frequency 1
for all of them and will lose one of the dimensions used in ranking.


-----Original Message-----
From: Kelvin Tan [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, March 14, 2002 9:51 PM
To: Lucene Users List
Subject: Re: Need pointers on using a very small part of Lucene


Robert,

>
> I just have one more question - how do I remove repeated words? Does 
> anyone have a filter for doing this?
>
> For example, here's the result of one of my files being worked on: 
> "todai customer.formattedmailingaddress3 dear customer.dearnam respond

> request inform productlongnam summari inform subjectssinglelin addit 
> question topic question bertek product call 1-888-523-7835 ext.9877 
> product inform productnam includ correspond product inform product 
> bertek product electron internet site http wwwbertekcom interest 
> bertek pharmaceut product appreci sincer responsiblehcp.signatur 
> responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi 
> responsiblehcp.initi casenumb cc salesrep.fullnam enclosur 
> enclosurestextbullet bodytext"
>
>
> If you look closely, you'll see the word 'question' repeated twice.

One way to do it is to write a TokenFilter, and basically construct a
Set of all token.termText. in the next() method, provide a check to see
if this token.termText already exists in the Set. If so, ignore it. If
not, add it to the set and carry on. Note that this may be rather
memory-intensive...:)

HTH.

Regards,
Kelvin

>
>
> thanks,
> rob
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Need pointers on using a very small part of Lucene

Reply via email to