[jira] Commented: (LUCENE-1333) Token implementation needs improvements

DM Smith (JIRA) Wed, 13 Aug 2008 05:55:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622201#action_12622201
 ]


DM Smith commented on LUCENE-1333:
----------------------------------

{quote}
Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing 
that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I 
am not sure.
{quote}

I was going on hearsay when I uniformly used clone() rather than new when 
dealing with creating a deep copy of an existing token. I was under the 
impression that clone was faster than new to do equivalent work.

The test is rather simple and worthy of doing before accepting this issue. I 
don't think I have time to do it today.

The equivalent of clone is (done from memory, so this is close):
Token token = new Token(oldToken.startOffset(), oldToken.endOffset(), 
oldToken.getFlags(), oldToken.type());
token.setPositionIncrement(oldToken.positionIncrement());
if (oldToken.getPayload() != null) {
 Payload p = new Payload(....); // Create a new Payload with a deep copy of the 
payload
}

While this might be faster, there are two flaws with this that clone avoids, 
clone has direct access to the parts and avoids method calls and also is future 
proof. If a new field is added to Token, it will automatically be carried 
forward.

There are a couple of places in the code where:
public Token(Token prototype) // only if new is faster
and
public void copyFrom(Token prototype)
would beneficially solve these maintenance issues.

{quote}
But how do you cope with reset()?
{quote}
This problem is a bug in the existing code. Today, one can create a chain of 
TokenFilters, each of which calls input.next() or input.next(token), and any 
one of which modifies the return value. It does not matter which is invoked. If 
the token returned is held in a cache then the cache is corrupted. Every cache 
of Tokens needs to ensure that it's cache is immutable on creation. It also 
needs to ensure that it is immutable on usage if the tokens can be served more 
than once.

Two personal opinions:
* Caches that don't implement reset should return cache.remove(0) [or 
equivalent] so it is clear that the cache can only be used once.
* Caches should not be used except when it gives a clear performance advantage.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, 
> LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, 
> LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, 
> LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, 
> LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, 
> LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, 
> LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference 
> so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/[EMAIL 
> PROTECTED]
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to 
> use the class.
> 2. Deprecations are incomplete. The constructors that take String as an 
> argument and the methods that take and/or return String should *all* be 
> deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting 
> buffer can be over-allocated. A less aggressive algorithm would be better. In 
> the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be 
> upgraded now rather than waiting for 3.0. As it stands, filter chains that 
> alternate between char[] and String are sub-optimal. Currently, it is used in 
> core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so 
> that reuse is maximized and descendant classes should be rewritten to 
> over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add 
> constructors that took a Token. This would simplify the use of the two 
> together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Reply via email to