[
https://issues.apache.org/jira/browse/LUCENE-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless resolved LUCENE-1181.
----------------------------------------
Resolution: Won't Fix
> Token reuse is not ideal for avoiding array copies
> --------------------------------------------------
>
> Key: LUCENE-1181
> URL: https://issues.apache.org/jira/browse/LUCENE-1181
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 2.3
> Reporter: Trejkaz
>
> The way the Token API is currently written results in two unnecessary array
> copies which could be avoided by changing the way it works.
> 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the
> original term text even though it's about to be overwritten.
> #1 should be trivially fixable by introducing a private
> resizeTermBuffer(int,boolean) where the new boolean parameter specifies
> whether the existing term data gets copied over or not.
> 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually
> setting the term buffer.
> Setting aside the fact that the setTermBuffer method is misleadingly named,
> consider a token filter which performs Unicode normalisation on each token.
> How it has to be implemented at present:
> once:
> - create a reusable char[] for storing the normalisation result
> every token:
> - use getTermBuffer() and getTermLength() to get the buffer and relevant
> length
> - normalise the original string into our temporary buffer (if it isn't
> big enough, grow the temp buffer size.)
> - setTermBuffer(byte[],int,int) - this does an extra copy.
> The following sequence would be much better:
> once:
> - create a reusable char[] for storing the normalisation result
> every token:
> - use getTermBuffer() and getTermLength() to get the buffer and relevant
> length
> - normalise the original string into our temporary buffer (if it isn't
> big enough, grow the temp buffer size.)
> - setTermBuffer(byte[],int,int) sets in our buffer by reference
> - set the term buffer which used to be in the Token such that it becomes
> our new temp buffer.
> The latter sequence results in no copying with the exception of the
> normalisation itself, which is unavoidable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]