[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

Uwe Schindler (JIRA) Tue, 11 Aug 2009 11:52:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742012#action_12742012
 ]


Uwe Schindler commented on LUCENE-1801:
---------------------------------------

There is an additional problem (mentioned above):
There is currently defined the clear() in Attribute interface, but this leads 
to the problem, that it is really more the task of AttributeImpl. The porblem 
is, that e.g. one TokenFilter, that adds a new type of attribute must could 
clear it. On the other hand, Token, which implements *all* attributes could 
also provide clear(). By this, there is an inconsistency: clear shold be 
removed from the general Attribute interface and moved downto each separate 
interface with a separate name. E.g. if somebody calls TermAttribute.clear, but 
may think that it only clears the term attribute may be wrong, if the actual 
implementation is Token, whic clears everything.

The biggest problem is backwards compatibility. Lucene 2.4.1 states in JavaDocs 
of Token: "public void clear(): Resets the term text, payload, flags, and 
positionIncrement to default. Other fields such as startOffset, endOffset and 
the token type are not reset since they are normally overwritten by the 
tokenizer."

I would propose to change the whole thing:
- Remove clear() from the superinterface Attribute.
- Let do Token what it is used to (as of 2.4.1)
- Define a clearTerm(), clearPositionIncrement(), clearFlags() method for each 
attribute type separate (Token/TokenWrapper must implement it).
- clear() is only defined in AttributeImpl and clears the whole implementation. 
AttributeSource.clearAttributes calls this method. Current code calling clear() 
on the attribute interface will fail to compile, but these are the places that 
must be fixed.

The problem of backwards compatibility can be solved the following way:
- TokenWrapper clears the complete delegate Token to be consistent with 
AttributeSource.clearAttributes() - complete reset to default values incl 
offset. A problem only occurs if somebody registers Token (not the wrapper 
around as Attribute), then clearAttributes() would not be consistent with the 
rest, as it would miss to clear the offset.

How will we handle the clearAttributes() call in Tokenizers then? Should we 
only clear those attributes we work on in a Tokenizer/TokenFilter?

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1801
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1801
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

Reply via email to