[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Uwe Schindler (JIRA) Tue, 16 Jun 2009 03:43:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720052#action_12720052
 ]


Uwe Schindler commented on LUCENE-1693:
---------------------------------------

{quote}
bq. What was your concusion about my idea yesterday, to pass the Token around 
even with the old API and copy it on demand

I don't think the indexer should know at all about Token? It should only use 
the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a 
performance hit. I had the idea to throw the exception in incrementToken() to 
automatically being able to fallback to the old API. I think this is nice and 
gets rid of the explicit useNewAPI methods. The only drawback is still the fact 
that we have to implement both old and new APIs in Lucene's tokenizers and 
filters until we remove the old API. Grant won't like this  but I think this is 
better than the possible performance hit? Also we don't add new filters THAT 
often to Lucene and implementing both methods is often mostly copy&paste.
{quote}

I did not mean to use Token in the indexer, I would like to remove the whole 
old API from Token (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates 
it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems. 
If he doesn't, the default impl in TokenStream would call next(Token), using 
the token instance from AttributeSource. If the method returns with another 
Token instance (because it did not reuse, which is seldom I think), copy this 
returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer, 
query parser,...), the TokenStream only implemented the new API, the 
deprectated old method would also have a default impl, ignoring the token 
supplied to next() and returning always the instance-token after calling 
incrementToken().

Because of this, the indexer would always use the new API, the old API is 
wrapped. Core analyzers only need to implement the new methods (or could even 
stay with the old).

There are two problems:
- If somebody does not implement either method, the call to incrementToken() 
will loop endless, there should be some early break with UOE in this case. Do 
not know how to implement this without inspecting the stack trace in the 
default methods.
- Filters are a bit tricky: They could pass in both methods per default to the 
delegate. The problem, if one only implements one of the methods, the other 
passes down to the delegate, and doing nothing. But this could be fixed by 
delegating in the deprecated old method always to the new one (or vice versa).

Hope this was clear, maybe I should create a patch.

> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1693.patch
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to