RE: Email Filter using Lucene 3.0

Uwe Schindler Fri, 29 Jan 2010 04:52:24 -0800

Can you send us the original filter?

The implementation below is wrong in the whole design. All attributes are 
singletons in each instance of this TokenStream, so your code cannot work. 
addAttribute always return the same instance. You have to register the 
singletons in the ctor using addAttribute and then fill this one instance with 
the correct text parts on each call to incrementToken():


Here my proposal, its just pseudo-code:

In ctor:

        Define a class member (!!!) LinkedList<String> for your splitted email 
addresses, initially empty
        termAtt = addAttribute(TermAttribute.class);

In incrementToken:

if (!linkedlist.isEmpty()) { 
        clearAttributes(); // important, do this only here, if you do it in the 
filter part you will break your stream!!!!
        termAtt.setTermBuffer(linkedList.removeFirst());
        // set eventually offsets and so on in the other attributes
        return true;
} else {
        if (!input.incrementToken()) return false;
        read the term (the one input generated) text using termAtt.term() (no 
need for new String, termAtt is the one registered in ctor)
        split your term into the token parts and add all token parts to the 
linkedList<String> above
        //recurse to self, as the list has no elements, if it is still empty it 
will recurse again:
        return incrementToken();
}

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Jamie [mailto:[email protected]]
> Sent: Friday, January 29, 2010 1:29 PM
> To: [email protected]
> Subject: Email Filter using Lucene 3.0
> 
> Hi THere
> 
> In the absence of documentation, I am trying to convert an EmailFilter
> class to Lucene 3.0. Its not working! Obviously, my understanding of
> the
> new token filter mechanism is misguided.
> Can someone in the know help me out for a sec and let me know where I
> am
> going wrong. Thanks.
> 
> import org.apache.commons.logging.*;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.tokenattributes.TermAttribute;
> 
> import java.io.IOException;
> import java.io.Serializable;
> import java.util.ArrayList;
> import java.util.Stack;
> 
> /* Many thanks to Michael J. Prichard" <[email protected]> for
> his
>   * original the email filter code. It is rewritten. */
> 
> public class EmailFilter extends TokenFilter  implements Serializable {
> 
>      public EmailFilter(TokenStream in) {
>          super(in);
>      }
> 
>      public final boolean incrementToken() throws java.io.IOException {
> 
>          if (!input.incrementToken()) {
>              return false;
>          }
> 
> 
>          TermAttribute termAtt = (TermAttribute)
> input.getAttribute(TermAttribute.class);
> 
>          char[] buffer = termAtt.termBuffer();
>          final int bufferLength = termAtt.termLength();
>          String emailAddress = new String(buffer, 0,bufferLength);
>          emailAddress = emailAddress.replaceAll("<", "");
>          emailAddress = emailAddress.replaceAll(">", "");
>          emailAddress = emailAddress.replaceAll("\"", "");
> 
>          String [] parts = extractEmailParts(emailAddress);
>          clearAttributes();
>          for (int i = 0; i < parts.length; i++) {
>              if (parts[i]!=null) {
>                  TermAttribute newTermAttribute =
> addAttribute(TermAttribute.class);
>                  newTermAttribute.setTermBuffer(parts[i]);
>                  newTermAttribute.setTermLength(parts[i].length());
>              }
>          }
>          return true;
>      }
> 
>      private String[] extractWhitespaceParts(String email) {
>          String[] whitespaceParts = email.split(" ");
>          ArrayList<String> partsList = new ArrayList<String>();
>          for (int i=0; i < whitespaceParts.length; i++) {
>              partsList.add(whitespaceParts[i]);
>          }
>          return whitespaceParts;
>      }
> 
>      private String[] extractEmailParts(String email) {
> 
>          if (email.indexOf('@')==-1)
>              return extractWhitespaceParts(email);
> 
>          ArrayList<String> partsList = new ArrayList<String>();
> 
>          String[] whitespaceParts = extractWhitespaceParts(email);
> 
>           for (int w=0;w<whitespaceParts.length;w++) {
> 
>               if (whitespaceParts[w].indexOf('@')==-1)
>                   partsList.add(whitespaceParts[w]);
>               else {
>                   partsList.add(whitespaceParts[w]);
>                   String[] splitOnAmpersand =
> whitespaceParts[w].split("@");
>                   try {
>                       partsList.add(splitOnAmpersand[0]);
>                       partsList.add(splitOnAmpersand[1]);
>                   } catch (ArrayIndexOutOfBoundsException ae) {}
> 
>                  if (splitOnAmpersand.length > 0) {
>                      String[] splitOnDot =
> splitOnAmpersand[0].split("\\.");
>                       for (int i=0; i < splitOnDot.length; i++) {
>                           partsList.add(splitOnDot[i]);
>                       }
>                  }
>                  if (splitOnAmpersand.length > 1) {
>                      String[] splitOnDot =
> splitOnAmpersand[1].split("\\.");
>                      for (int i=0; i < splitOnDot.length; i++) {
>                          partsList.add(splitOnDot[i]);
>                      }
> 
>                      if (splitOnDot.length > 2) {
>                          String domain = splitOnDot[splitOnDot.length-
> 2]
> + "." + splitOnDot[splitOnDot.length-1];
>                          partsList.add(domain);
>                      }
>                  }
>               }
>           }
>          return partsList.toArray(new String[0]);
>      }
> 
> }
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Email Filter using Lucene 3.0

Reply via email to