Out of curiosity - does anyone use a Filter based on string (token) length. Use case is, say, you're indexing email msgs and if an attachment is uuencoded into lines of 60 or whatever characters then you don't want to index tokens that are so long as they can't possibly be of use later and just eat up disk space.


Plz feel free to add this to sandbox with whatever license is appropriate.

The code is easy:

/**
* Removes words that are too long and too short from the stream
*/
public final class StrlenFilter
extends TokenFilter
{
/**
* Build a filter that removes words that are too long or too short from the text.
*/
public StrlenFilter(TokenStream in, int min, int max)
{
input = in;
this.min = min;
this.max =max;
}


        /** Returns the next input Token whose termText() is the right len
         */
        public final Token next() throws IOException
        {
                // return the first non-stop word found
                for (Token token = input.next(); token != null; token = input.next())
                {
                        final int len = token.termText().length();
                        if ( len >= min && len <= max)
                                return token;
                        // note: else we ignore it but should we index each part of it?
                }
                // reached EOS -- return null           
                return null;
        }
        final int min;
        final int max;
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to