I remember you sending this once before....a long time ago.
This time I'll stick it in sandbox/contributions/analyzers/...
Thanks!

Otis

--- David Spencer <[EMAIL PROTECTED]> wrote:
> 
> Out of curiosity - does anyone use a Filter based on string (token) 
> length. Use case is, say, you're indexing email msgs and if an 
> attachment is uuencoded into lines of 60 or whatever characters then
> you 
>   don't want to index tokens that are so long as they can't possibly
> be 
> of use later and just eat up disk space.
> 
> Plz feel free to add this to sandbox with whatever license is
> appropriate.
> 
> The code is easy:
> 
> /**
>   * Removes words that are too long and too short from the stream
>   */
> public final class StrlenFilter
>       extends TokenFilter
> {
>       /**
>        * Build a filter that removes words that are too long or too short 
> from the text.
>        */
>       public StrlenFilter(TokenStream in, int min, int max)
>       {
>               input = in;
>               this.min = min;
>               this.max =max;
>       }
> 
>       /** Returns the next input Token whose termText() is the right len
>        */
>       public final Token next() throws IOException
>       {
>               // return the first non-stop word found
>               for (Token token = input.next(); token != null; token =
> input.next())
>               {
>                       final int len = token.termText().length();
>                       if ( len >= min && len <= max)
>                               return token;
>                       // note: else we ignore it but should we index each part of it?
>               }
>               // reached EOS -- return null           
>               return null;
>       }
>       final int min;
>       final int max;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to