[jira] Commented: (LUCENE-1333) Token implementation needs improvements

DM Smith (JIRA) Mon, 11 Aug 2008 05:04:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621417#action_12621417
 ]


DM Smith commented on LUCENE-1333:
----------------------------------

Caching flies in the face of reuse. I think that a comment needs to be some 
where to that effect.
Putting Tokens into a collection requires that the reusable token be copied. 
I.e. via new or clone. One cannot directly store the reusable Token, i.e. the 
argument from Token next(Token), nor the value to be returned from it.

If a caching TokenStream is also resettable, then that means that the tokens 
coming from it need to be protected from being changed. This means that they 
need to return a copy. (Perhaps comment reset() to that effect?)

The only value I see in caching is if the computation of the token stream is so 
expensive that re-using it has a significant savings.

(The current code does not take such care and is a bug. This patch fixes it. It 
would have become obvious if the cache were used in the context of reuse.)

Some TokenStreams cache Tokens internally (as a detail of their implementation) 
and then return them incrementally. Many of these can be rewritten to compute 
the next Token when next(Token) is called. This would improve both time and 
space usage.

(I felt that such changes were outside the scope of this patch.)

All this leads to my response to the NGram filter.

The NGram filters could be improved in this fashion. This would eliminate the 
clone() problem noted above.

But failing that, a variant of clone to solve the intermediate problem would 
work. So would using new Token(...). The problem with using new Token() is that 
it requires manual propagation of flags, payloads, offsets and types and is not 
resilient to future fields.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, 
> LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, 
> LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, 
> LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, 
> LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, 
> LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, 
> LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference 
> so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/[EMAIL 
> PROTECTED]
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to 
> use the class.
> 2. Deprecations are incomplete. The constructors that take String as an 
> argument and the methods that take and/or return String should *all* be 
> deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting 
> buffer can be over-allocated. A less aggressive algorithm would be better. In 
> the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be 
> upgraded now rather than waiting for 3.0. As it stands, filter chains that 
> alternate between char[] and String are sub-optimal. Currently, it is used in 
> core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so 
> that reuse is maximized and descendant classes should be rewritten to 
> over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add 
> constructors that took a Token. This would simplify the use of the two 
> together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Reply via email to