"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 19, 2007 6:52 PM, Michael Busch <[EMAIL PROTECTED]> wrote: > > Yonik Seeley wrote: > > > > > > So I think we all agree to do payloads by reference (do not make a > > > copy of byte[] like termBuffer does), and to allow payload reuse. > > > > > > So now we still have 3 viable options still on the table I think: > > > Token{ byte[] payload, int payloadLength, ...} > > > Token{ byte[] payload, int payloadOffset, int payloadLength,...} > > > Token{ Payload p, ... } > > > > > > > I'm for option 2. I agree that it is worthwhile to allow filters to > > modify the payloads. And I'd like to optimize for the case where lot's > > of tokens have payloads, and option 2 seems therefore the way to go. > > Just to play devil's advocate, it seems like adding the byte[] > directly to Token gains less than we might have been thinking if we > have reuse in any case. A TokenFilter could reuse the same Payload > object for each term in a Field, so the CPU allocation savings is > closer to a single Payload per field using payloads. > > If we used a Payload object, it would save 8 bytes per Token for > fields not using payloads. > Besides an initial allocation per field, the additional cost to using > a Payload field would be an additional dereference (but that should be > really minor).
These are excellent points. I guess I would lean [back] towards keeping the separate Payload object and extending its API to allow re-use and modification of its byte[]? I'm now even wondering whether the char[] termBuffer should be by reference (again!), too? This would save 1 copy for those TokenStreams that could provide a reference to their own char[] buffers (eg CharTokenizer). Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]