[jira] Commented: (LUCENE-1422) New TokenStream API

Michael Busch (JIRA) Mon, 03 Nov 2008 18:38:48 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644881#action_12644881
 ]


Michael Busch commented on LUCENE-1422:
---------------------------------------

Thanks for the thorough review, Mike! I made most of the changes you suggested. 
However, I've some comments to some of your points:

{quote}
I assume we would statically default TokenStream.useNewAPI to
true?
{quote}

I don't think it should default to true. How it currently works (and this 
answers also some of your other questions) is that one can't mix old and new 
streams and filters in the same chain. If someone enables the new API, then 
*all* streams and filters in one chain have to implement the new API. The 
reason is that you can't "simulate" the old API by calling the new methods in 
an efficient way. We would have to copy all values from the Token that 
next(Token) returns into the appropriate Attributes.
This would slow down the ingestion performance and I think would affect 
backwards-compatibility: We guarantee that you can update from 2.4 to 2.9 
without getting compile and runtime errors, but I think the performance should 
also not decrease significantly? That's the main reason why I left two 
implementations in all core streams and filters, for next(Token) and 
incrementToken().

{quote}
Are DocInverter.BackwardsCompatibilityStream and
TokenStreamTestUtils.BackwardsCompatibleStream basically the same
thing? If so, can we merge them? It seems like others may needs
this (when mixing old/new analyzer chains - the question above).
{quote}

Sorry for the silly naming of these classes. They are not the same. The one in 
DocInverter is basically a wrapper with Attributes around an old Token. This is 
used so that almost all consumers in the indexer package can already use the 
new API in an efficient way.
The one in the test package however is used, so that TokenStream and -Filter 
implementations in our testcases don't have to have implementations for both 
the old and new API. If the old next(Token) is called, then it calls 
incrementToken() and copies over the values from the Attributes to the Token. 
Since performance is not critical in testcases, I decided to take this approach 
to reduce code duplication in the tests.

{quote}
Many unit tests had to be changed with this patch. Was this entirely due to 
avoiding the newly deprecated APIs, or, are there any tests that fail if they 
were not changed? (This is a simple check to run, actually - only apply your 
core changes, then run all tests). If it's the former (which it should be), 
maybe we should leave a few tests using the deprecated APIs to ensure we don't 
break them before 3.0?
{quote}

That's actually the reason why I want to keep the static setUseNewAPI() method. 
When I test this patch I run all tests twice, in both modes. That way I can be 
sure to not break backwards compatibility.

{quote}
On the new TokenStream.start method: is a TokenStream allowed to
not allow more than 1 start invocation? Meaning, it cannot repeat
itself. Just like we are grappling with on DocIdset.iterator()
it'd be good to address this semantics up front.
{quote}

Not sure I follow you here. Could you elaborate?

> New TokenStream API
> -------------------
>
>                 Key: LUCENE-1422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1422
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1422-take4.patch, lucene-1422.patch, 
> lucene-1422.take2.patch, lucene-1422.take3.patch, lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1422) New TokenStream API

Reply via email to