Yonik Seeley wrote:
On Nov 19, 2007 7:02 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
1) If we are deprecating some methods like String termText(), how
about at the same time deprecating "String type"? If we want
lightweight per-token metadata for communication between filters, an
int or a long used as a bitvector (32 or 64 independent boolean vars
per token) would be much more useful than a single String.
There are tokenizers that use the type string, e.g., StandardFilter &
similar things in Nutch. How would you replace such uses? Add a bit
for each token type? Is that really that much more useful?
It is, given that it enables a token to have more than one type at once.
The benefit is probably relatively minor (the number of people who
would use it), and I wouldn't have brought it up except that it could
piggy-back on the other recent changes to Token.
I'm just a lurker!
However, I'll chime in and say that this sounds interesting. But please
use a long if you do such a thing - better to have some extra bits
available for future, and given that most future lucenes' will run on 64
bit system, such a thing shouldn't give a performance impact.
You could however use a String[], or use Set<String>, to communicate (or
potentially use "comma-separated values" in the one String, but this
makes uniquely identifying your particular token somewhat messy). Will
the restriction of 32 core bits and 32 user bits ever be a problem? What
about completely different usages, like categorizing something into an
indefinite number of bins? (Just to be the devil's advocate..)
A Michael mentioned setting some reference to null, with the result
being that GC kicked in more often. If this is the case for that
particular scenario, then please don't optimize along those lines.
Getting rid of your never-to-be-used-again objects as fast as possible
is _always_ good, and if it in some strange situation seems opposite,
then that will probably change radically in the next iteration of GC
development - or for example by setting the huge bunch of GC selection
and tuning parameters correct .. or something..
With that said, obviously reusing the char[] is the better way to go:
not creating an object at all is of course better than dropping an
object, then recreate the same thing moments afterwards.
Have you run your profilers on this question? Seems like a prudent thing
to do if you're in a situation where some API will change any way.
Thanks for reading my ramblings,
Endre.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]