On Mar 10, 2004, at 1:08 PM, Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
       super(in);
  -    table = stopTable;
  +    this.stopWords = new HashSet(stopWords);
     }

This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance.

Ok, after some more thinking on this, part of the dilemma is also that analyzers generally construct all of the tokenizers/tokenfilters in the tokenStream method. It would seem better for them to keep instance variables for all the non-variant pieces.


With the change to HashSet, any custom analyzers (once the dust settles on this change, I'll convert the built-in code to use the new methods) will be using the Hashtable ctor thinking it is the most efficient one and now it is not. Is this a problem?

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to