[jira] Commented: (LUCENE-1333) Token implementation needs improvements

DM Smith (JIRA) Sat, 09 Aug 2008 05:17:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621132#action_12621132
 ]


DM Smith commented on LUCENE-1333:
----------------------------------

I'll give my analysis here. Feel free to make the change or kick it back to me 
to make it, if you think your pattern is best. (If I do it, it will be after 
this weekend.)

I've tried to be consistent here. The prior pattern was inconsistent and often 
was:
{code}
 Token token = null;
  while ((token = input.next()) != null) {
{code}

There were other variants including "forever loops". As you noticed, I replace
There are two basic implementations of Token next(Token):
1) Producer: These create tokens from input. Their pattern is to take their 
argument and call clear on it and then set startOffset, endOffset and type 
appropriately. Their assumption is that they have to start with a pristine 
token and that other than space, there is nothing about the token that is 
passed in that can be reused.

2) Consumer: These "filter" their argument. Their only assumption is that in 
the call chain that there was a producer that created the token that they need 
to reuse. In this case, they typically will preserve startOffset and endOffset 
because those are to represent the position of the token in the input. They may 
refine type, flags and payload, but otherwise have to preserve them. Most 
typically, they will set the termBuffer. There are a few types of consumers. 
Here are some:
a) Transformational Filters: They take their argument and transform it's 
termBuffer.
b) Splitting Filters: They take their argument and split the token into 
several. Sometimes they will return the original; other times just the parts. 
When creating these tokens, calling clone() on the prototype will preserve 
flags, payloads, start and end offsets and type. These clones are sometimes 
stored in a buffer, but sometimes are incrementally computed with each call to 
next(Token). With the latter, they will typically cache a clone of the passed 
in token. I think that, when possible, incremental computation is preferable, 
but at the cost of a less obvious implementation.
c) Caching Filter: If their buffer is empty, they repeatedly call result = 
input.next(token), clone and buffer cache their result in some collection. Once 
full, they will return their buffer's content. If, the caching filter is 
resettable, they must return clones of their content. Otherwise, down stream 
consumers may change their arguments, disastrously.

Callers of Token next(Token) have the responsibility of never calling with a 
null token. (I think producer tokens probably should check and create a token 
if it is so. But I don't think that is what they do now.)

The upshot of all of this, Producers don't care which token they reuse. If it 
was from the original loop, or from the result of the last call to token = 
stream.next(token), both are equally good. The token pre-existed and needs to 
be fully reset. Consumers presume that the token was produced (or at least 
appropriately re-initialized and filled in) by a producer.

Your form of the loop is very advisable in a few places. Most typically with a 
loop within a loop, with the inner looping over all the tokens in a stream. In 
this case, the final Token would be created outside the outer loop. Using your 
pattern, there would encourage maximal reuse. Using mine, the programmer would 
have to figure out when it was appropriate to do one or the other.

The other value to your pattern is that next(Token) is always called with a 
non-null Token.

I think that calling the token "result" is not the best. It is a bit confusing 
as it is not the result of calling next(Token). Perhaps, to make reuse acutely 
obvious:
{code}
 final Token reusableToken = new Token();
 for (Token token = stream.next(reusableToken); token != null; token = 
stream.next(reusableToken)) {
{code}



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, 
> LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, 
> LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, 
> LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, 
> LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, 
> LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, 
> LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference 
> so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/[EMAIL 
> PROTECTED]
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to 
> use the class.
> 2. Deprecations are incomplete. The constructors that take String as an 
> argument and the methods that take and/or return String should *all* be 
> deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting 
> buffer can be over-allocated. A less aggressive algorithm would be better. In 
> the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be 
> upgraded now rather than waiting for 3.0. As it stands, filter chains that 
> alternate between char[] and String are sub-optimal. Currently, it is used in 
> core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so 
> that reuse is maximized and descendant classes should be rewritten to 
> over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add 
> constructors that took a Token. This would simplify the use of the two 
> together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Reply via email to