"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > I had previously missed the changes to Token that add support for > using an array (termBuffer): > > + // For better indexing speed, use termBuffer (and > + // termBufferOffset/termBufferLength) instead of termText > + // to save new'ing a String per token > + char[] termBuffer; > + int termBufferOffset; > + int termBufferLength; > > While I think this approach would have been best to start off with > rather than String, > I'm concerned that it will do little more than add overhead at this > point, resulting in slower code, not faster. > > - If any tokenizer or token filter tries setting the termBuffer, any > downstream components would need to check for both. It could be made > backward compatible by constructing a string on demand, but that will > really slow things down, unless the whole chain is converted to only > using the char[] somehow.
Good point: if your analyzer/tokenizer produces char[] tokens then your downstream filters would have to accept char[] tokens. I think on-demand constructing a String (and saving it as termText) would be an OK solution? Why would that be slower than having to make a String in the first place (if we didn't have the char[] API)? It's at least graceful degradation. > - It doesn't look like the indexing code currently pays any attention > to the char[], right? It does, in DocumentsWriter.addPosition(). > - What if both the String and char[] are set? A filter that doesn't > know better sets the String... this doesn't clear the char[] > currently, should it? Currently the char[] wins, but good point: seems like each setter should null out the other one? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
