[jira] Commented: (LUCENE-1422) New TokenStream API

Michael McCandless (JIRA) Thu, 13 Nov 2008 09:21:32 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647337#action_12647337
 ]


Michael McCandless commented on LUCENE-1422:
--------------------------------------------

Looks good!

I'm seeing this failure (in test I just committed this AM).  I think
it's OK, because the new API is enabled for all tests and I'm using
the old API with that analyzer?

{code}
    [junit] Testcase: 
testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):  Caused an 
ERROR
    [junit] This token does not have the attribute 'class 
org.apache.lucene.analysis.tokenattributes.TermAttribute'.
    [junit] java.lang.IllegalArgumentException: This token does not have the 
attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'.
    [junit]     at 
org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:124)
    [junit]     at 
org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
    [junit]     at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:144)
    [junit]     at 
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
    [junit]     at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
    [junit]     at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:760)
    [junit]     at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:738)
    [junit]     at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
    [junit]     at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
    [junit]     at 
org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:304)
    [junit]     at 
org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:287)
    [junit]     at 
org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:315)
{code}

Some other random questions:

  * I'm a little confused by 'start' and 'initialize' in TokenStream.
    DocInverterPerField (consumer of this API) calls
    analyzer.reusableTokenStream to get a token stream.  Then it calls
    stream.reset().  Then it calls stream.start() which then calls
    stream.initialize().  Can we consolidate these (there are 4 places
    that we call to "start" the stream now)?
.
    EG why can't analyzer.reusableTokenStream() do the init internally
    in the new API?  (Also, in StopFilter, initialize() sets termAtt &
    posIncrAtt, but I would think this only needs to happen once when
    that TokenFilter is created?  BackCompatTokenStream add the attrs
    in its ctor, which seems better.).

  * BackCompatTokenStream is calling attributes.put directly but all
    others use super's addAttribute.

  * Why is BackCompatTokenStream overriding so many methods?  EG
    has/get/addAttribute -- won't super do the same thing?

  * Maybe add reasons to some of the asserts, eg StopFilter has
    "assert termAtt != null", so maybe append to that ": initialize()
    wasn't called".


> New TokenStream API
> -------------------
>
>                 Key: LUCENE-1422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1422
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1422-take4.patch, lucene-1422-take5.patch, 
> lucene-1422.patch, lucene-1422.take2.patch, lucene-1422.take3.patch, 
> lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1422) New TokenStream API

Reply via email to