> I suppose org.apache.lucene.analysis.LowerCaseFilter and > PorterStemFilter modify the Token termText property as an > optimization. Their next() method will be called once for each token > for each filter in the chain of filters during analysis. Creating a > new Token for every modification could create a _lot_ of objects to > be garbage collected.
Well, in the worst case, it would multiply by 1.5 the number of objects created by the tokenization process, since the Token is mostly a wrapper for the term text (a String), and the term text needs to be created regardless (two objects -- the String object and its underlying char array.) And if the term text is put together with StringBuffer operations (explicitly or implicitly), that's more objects. But the most likely case is considerably better; there are lots of other intermediate objects created as a result of tokenization, too. I'm guessing that you're talking about no more than a 15% increase in garbage creation during this particular part of the process. And JVMs have gotten a LOT smarter about dealing with temporary intermediate objects. Immutability has a lot of advantages. True, it sometimes means a performance hit. But are you sure we've really got one here? First, the time spent in tokenization is a small percentage of Lucene's overall work, as searches are generally more frequent than additions (otherwise we'd just use grep.) Second, does anyone feel that Lucene's tokenization is too slow? I sure don't. Unless someone can demonstrate a real performance problem, I think we're better off doing it "right" -- make Term immutable. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>