[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

Uwe Schindler (JIRA) Wed, 15 Jul 2009 10:05:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731555#action_12731555
 ]


Uwe Schindler edited comment on LUCENE-1693 at 7/15/09 10:04 AM:
-----------------------------------------------------------------

After the whole day thinking about a solution for overriding deprecated 
methods, I came to one conclusion/solution, that I would create a "visible" 
backwards break (to be noted in CHANGES.txt).

Mike's idea from LUCENE-1678 is good, but very complicated for this issue and 
may lead to unpredicted behavior. And what makes me think, that this will not 
be a problem for developers, is the fact that there is no JIRA issue about a 
similar break in the past: When Lucene switched from next() to 
next(reusableToken), we also had a compatibility method in TokenStream that 
delegates to next(new Token()). Core streams did *not* implement the old method 
and the indexer code only called next(Token). If somebody would have overridden 
only the old next() method of a core tokenstream, this method would have been 
never called -> bumm we have a break, but nobody realized it. With the new 
patch, we have the same in 2.9 for incrementToken vs. next(Token) and also 
next(). In principle the same issue like in LUCENE-1678.

The good thing is, that most TokenStreams in core are final, except the 
following ones:

- ISOLatin1Filter
- KeywordTokenizer
- StandardTokenizer

...and last but not least the whole structure of subclasses of CharTokenizer. 
The good thing is (and thanks to the developer!), they are correctly 
implemented, making their methods incrementToken, next(Token) *final*. Haha, 
nobody could override them, so the class is not final, but the affected 
methods. So all subclasses of CharTokenizer are also not affected.

My latest patch also includes this *final* modifier for the abstract 
CharTokenizer:
{code}
public final Token next(final Token reusableToken) throws IOException {
 // Overriding this method to make it final as before has no effect for the 
reflection-wrapper in TokenStream.
 // TokenStream.hasReusableNext is true because of this, but it is never used, 
as incrementToken() has preference.
 return super.next(reusableToken);
}
{code}

So it is not overrideable and is still compatible (code calling next(Token) 
will be delegated to incrementToken() by the superclass). For complete 
correctness also next() should be similar overridden. In both cases the super's 
method always delegates preferably to incrementToken() so iven that a subclass 
of TokenStream overrides this method and so hasNext == true and hasReusableNext 
== true, incrementToken() is still preferred, so everything works.

This prevents users from overriding next() or next(Token) of core or contrib 
tokenstreams (which in my opinion nobody has ever done, because if yes, we 
would have a bug report regarding the last transition).

For those people, that really have done it (they used one of the tree classes 
above as super for their own class), the error would not be to detectable. 
Their TokenStream would simply not work, as next()/next(Token) is never called. 
To produce a compile error for them (or a runtime error, when they instantiate 
such a class), I suggest to include this backwards-break (which is better than 
failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should 
simply include the code snipplet above to redeclare next() *and* next(Token) as 
final (only delegating to super) in the first subclass that implements 
incrementToken(). Instead of failing silently, users will get runtime linker 
errors (when they replace the lucene jar) or compile errors. We have done a 
similar change in TokenFilter, because we made the delegate stream final to 
prevent disturbing the attributes (Mike have done this in LUCENE-1636).

CHANGES.txt would contain this as BW-break together with the other breaks.

Any comments? Michael, what do you think?


      was (Author: thetaphi):
    After the whole day thinking about a solution for overriding deprecated 
methods, I came to one conclusion/solution, that would create a "visible" 
backwards break (to be noted in CHANGES.txt).

Mike's idea from LUCENE-1678 is good, but very complicated for this issue and 
may lead to unpredicted behavior. And what makes me think, that this will not 
be a problem for developers is the fact that there is no JIRA issue about a 
similar break. When Lucene switched from next() to next(reusableToken), we also 
had a compatibility method in TokenStream that delegates to next(new Token()). 
Core streams did *not* implement the old method and the indexer code only 
called next(Token). If somebody would have overridden only the old next() 
method of a core tokenstream, this method would have been never called -> bumm 
we had a break, but nobody realized it. With the new patch, we have the same in 
2.9 for incrementToken vs. next(Token) and also next(). In principle the same 
issue like in LUCENE-1678.

The good thing is, that most TokenStreams in core are final, except the 
following ones:
- ISOLatin1Filter
- KeywordTokenizer
- StandardTokenizer
and last but not least the whole structure of subclasses of CharTokenizer. The 
good thing and thanks to the developer, they are correctly implemented, making 
their methods incrementToken, next(Token) *final*. Haha, nobody could override 
them, so the class is not final, but the affected methods. So all subclasses of 
CharTokenizer are also not affected.

My latest patch also includes this *final* modifier for the abstract 
CharTokenizer:
{code}
public final Token next(final Token reusableToken) throws IOException {
 // Overriding this method to make it final as before has no effect for the 
reflection-wrapper in TokenStream.
 // TokenStream.hasReusableNext is true because of this, but it is never used, 
as incrementToken() has preference.
 return super.next(reusableToken);
}
{code}

So it is not overrideable and is still compatible (code calling next(Token) 
will be delegated to incrementToken() by the superclass). For complete 
correctness also next() should be similar overridden. In both cases the super's 
method always delegates preferably to incrementToken() so iven that a subclass 
of TokenStream overrides this method and so hasNext == true and hasReusableNext 
== true, incrementToken() is still preferred, so everything works.

To prevent users from overriding next() or next(Token) of core or contrib 
tokenstreams (which in my opinion nobody has ever done, because if yes, we 
would have a bug report regarding the last transition). For those people, that 
really have done it (they used one of the tree classes above as super for their 
own class, the error would not be to detect. Their TokenStream would simply not 
work, as next()/next(Token) is never called. To produce a compile error for 
them (or a runtime error, when they instantiate such a class), I suggest to 
include a backwards-break (which is better than failing silently). All 
non-final TokenStreams/Tokenizers/TokenFilters should simply include the code 
snipplet above to redeclare next() *and* next(Token) as final (only delegating 
to super). Instead of failing silently, users will get runtime linker errors 
(when they replace the lucene jar) or compile errors. We have done a similar 
change in TokenFilter, because we made the delegate stream final to prevent 
disturbing the attributes (Mike have done this in LUCENE-1636).

CHANGES.txt would contain this as BW-break together with the other breaks.

Any comments? Michael, what do you think?

  
> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to