I remember you sending this once before....a long time ago.
This time I'll stick it in sandbox/contributions/analyzers/...
Thanks!
Otis
--- David Spencer <[EMAIL PROTECTED]> wrote:
>
> Out of curiosity - does anyone use a Filter based on string (token)
> length. Use case is, say, you're indexing email msgs and if an
> attachment is uuencoded into lines of 60 or whatever characters then
> you
> don't want to index tokens that are so long as they can't possibly
> be
> of use later and just eat up disk space.
>
> Plz feel free to add this to sandbox with whatever license is
> appropriate.
>
> The code is easy:
>
> /**
> * Removes words that are too long and too short from the stream
> */
> public final class StrlenFilter
> extends TokenFilter
> {
> /**
> * Build a filter that removes words that are too long or too short
> from the text.
> */
> public StrlenFilter(TokenStream in, int min, int max)
> {
> input = in;
> this.min = min;
> this.max =max;
> }
>
> /** Returns the next input Token whose termText() is the right len
> */
> public final Token next() throws IOException
> {
> // return the first non-stop word found
> for (Token token = input.next(); token != null; token =
> input.next())
> {
> final int len = token.termText().length();
> if ( len >= min && len <= max)
> return token;
> // note: else we ignore it but should we index each part of it?
> }
> // reached EOS -- return null
> return null;
> }
> final int min;
> final int max;
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]