Yonik Seeley wrote:
On Nov 19, 2007 7:02 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
1) If we are deprecating some methods like String termText(), how
about at the same time deprecating "String type"?  If we want
lightweight per-token metadata for communication between filters, an
int or a long used as a bitvector (32 or 64 independent boolean vars
per token) would be much more useful than a single String.
There are tokenizers that use the type string, e.g., StandardFilter &
similar things in Nutch.  How would you replace such uses?  Add a bit
for each token type?  Is that really that much more useful?

It is, given that it enables a token to have more than one type at once.
The benefit is probably relatively minor (the number of people who
would use it), and I wouldn't have brought it up except that it could
piggy-back on the other recent changes to Token.

I'm just a lurker!

However, I'll chime in and say that this sounds interesting. But please use a long if you do such a thing - better to have some extra bits available for future, and given that most future lucenes' will run on 64 bit system, such a thing shouldn't give a performance impact.

You could however use a String[], or use Set<String>, to communicate (or potentially use "comma-separated values" in the one String, but this makes uniquely identifying your particular token somewhat messy). Will the restriction of 32 core bits and 32 user bits ever be a problem? What about completely different usages, like categorizing something into an indefinite number of bins? (Just to be the devil's advocate..)

A Michael mentioned setting some reference to null, with the result being that GC kicked in more often. If this is the case for that particular scenario, then please don't optimize along those lines. Getting rid of your never-to-be-used-again objects as fast as possible is _always_ good, and if it in some strange situation seems opposite, then that will probably change radically in the next iteration of GC development - or for example by setting the huge bunch of GC selection and tuning parameters correct .. or something..

With that said, obviously reusing the char[] is the better way to go: not creating an object at all is of course better than dropping an object, then recreate the same thing moments afterwards.

Have you run your profilers on this question? Seems like a prudent thing to do if you're in a situation where some API will change any way.

Thanks for reading my ramblings,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to