Hi Brian,

I've read rapidly through the analyser's code, but I'm in no way a lucene master. If I understood your statement correctly, you are saying that we would multiply the number of tokens by 1.5 per tokeniser it uses. A potential "optimisation" would be that sometimes the string could be reused since it's immutable as well.

Personally, I believe it would be cleaner to make it immutable (I think that's why this thread started), so +1.

Cheers,
Stephane

Brian Goetz wrote:

I suppose org.apache.lucene.analysis.LowerCaseFilter and
PorterStemFilter modify the Token termText property as an
optimization. Their next() method will be called once for each token
for each filter in the chain of filters during analysis. Creating a
new Token for every modification could create a _lot_ of objects to
be garbage collected.

Well, in the worst case, it would multiply by 1.5 the number of
objects created by the tokenization process, since the Token is mostly
a wrapper for the term text (a String), and the term text needs to be
created regardless (two objects -- the String object and its
underlying char array.) And if the term text is put together with
StringBuffer operations (explicitly or implicitly), that's more
objects. But the most likely case is considerably better; there are
lots of other intermediate objects created as a result of
tokenization, too. I'm guessing that you're talking about no more
than a 15% increase in garbage creation during this particular part of
the process. And JVMs have gotten a LOT smarter about dealing with
temporary intermediate objects.

Immutability has a lot of advantages. True, it sometimes means a
performance hit. But are you sure we've really got one here? First,
the time spent in tokenization is a small percentage of Lucene's
overall work, as searches are generally more frequent than additions
(otherwise we'd just use grep.) Second, does anyone feel that
Lucene's tokenization is too slow? I sure don't.
Unless someone can demonstrate a real performance problem, I think
we're better off doing it "right" -- make Term immutable.

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to