You should keep in mind though that frequency plays a role in computing the score. If you eliminate repeated words, you will have frequency 1 for all of them and will lose one of the dimensions used in ranking.
-----Original Message----- From: Kelvin Tan [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 14, 2002 9:51 PM To: Lucene Users List Subject: Re: Need pointers on using a very small part of Lucene Robert, > > I just have one more question - how do I remove repeated words? Does > anyone have a filter for doing this? > > For example, here's the result of one of my files being worked on: > "todai customer.formattedmailingaddress3 dear customer.dearnam respond > request inform productlongnam summari inform subjectssinglelin addit > question topic question bertek product call 1-888-523-7835 ext.9877 > product inform productnam includ correspond product inform product > bertek product electron internet site http wwwbertekcom interest > bertek pharmaceut product appreci sincer responsiblehcp.signatur > responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi > responsiblehcp.initi casenumb cc salesrep.fullnam enclosur > enclosurestextbullet bodytext" > > > If you look closely, you'll see the word 'question' repeated twice. One way to do it is to write a TokenFilter, and basically construct a Set of all token.termText. in the next() method, provide a check to see if this token.termText already exists in the Set. If so, ignore it. If not, add it to the set and carry on. Note that this may be rather memory-intensive...:) HTH. Regards, Kelvin > > > thanks, > rob > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
