Robert, > > I just have one more question - how do I remove repeated words? Does > anyone have a filter for doing this? > > For example, here's the result of one of my files being worked on: > "todai customer.formattedmailingaddress3 dear customer.dearnam respond > request inform productlongnam summari inform subjectssinglelin addit > question topic question bertek product call 1-888-523-7835 ext.9877 > product inform productnam includ correspond product inform product bertek > product electron internet site http wwwbertekcom interest bertek > pharmaceut product appreci sincer responsiblehcp.signatur > responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi > responsiblehcp.initi casenumb cc salesrep.fullnam enclosur > enclosurestextbullet bodytext" > > > If you look closely, you'll see the word 'question' repeated twice.
One way to do it is to write a TokenFilter, and basically construct a Set of all token.termText. in the next() method, provide a check to see if this token.termText already exists in the Set. If so, ignore it. If not, add it to the set and carry on. Note that this may be rather memory-intensive...:) HTH. Regards, Kelvin > > > thanks, > rob > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
