[ https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647337#action_12647337 ]
Michael McCandless commented on LUCENE-1422: -------------------------------------------- Looks good! I'm seeing this failure (in test I just committed this AM). I think it's OK, because the new API is enabled for all tests and I'm using the old API with that analyzer? {code} [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] java.lang.IllegalArgumentException: This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:124) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:144) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:760) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:738) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:304) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:287) [junit] at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:315) {code} Some other random questions: * I'm a little confused by 'start' and 'initialize' in TokenStream. DocInverterPerField (consumer of this API) calls analyzer.reusableTokenStream to get a token stream. Then it calls stream.reset(). Then it calls stream.start() which then calls stream.initialize(). Can we consolidate these (there are 4 places that we call to "start" the stream now)? . EG why can't analyzer.reusableTokenStream() do the init internally in the new API? (Also, in StopFilter, initialize() sets termAtt & posIncrAtt, but I would think this only needs to happen once when that TokenFilter is created? BackCompatTokenStream add the attrs in its ctor, which seems better.). * BackCompatTokenStream is calling attributes.put directly but all others use super's addAttribute. * Why is BackCompatTokenStream overriding so many methods? EG has/get/addAttribute -- won't super do the same thing? * Maybe add reasons to some of the asserts, eg StopFilter has "assert termAtt != null", so maybe append to that ": initialize() wasn't called". > New TokenStream API > ------------------- > > Key: LUCENE-1422 > URL: https://issues.apache.org/jira/browse/LUCENE-1422 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1422-take4.patch, lucene-1422-take5.patch, > lucene-1422.patch, lucene-1422.take2.patch, lucene-1422.take3.patch, > lucene-1422.take3.patch > > > This is a very early version of the new TokenStream API that > we started to discuss here: > http://www.gossamer-threads.com/lists/lucene/java-dev/66227 > This implementation is a bit different from what I initially > proposed in the thread above. I introduced a new class called > AttributedToken, which contains the same termBuffer logic > from Token. In addition it has a lazily-initialized map of > Class<? extends Attribute> -> Attribute. Attribute is also a > new class in a new package, plus several implementations like > PositionIncrementAttribute, PayloadAttribute, etc. > Similar to my initial proposal is the prototypeToken() method > which the consumer (e. g. DocumentsWriter) needs to call. > The token is created by the tokenizer at the end of the chain > and pushed through all filters to the end consumer. The > tokenizer and also all filters can add Attributes to the > token and can keep references to the actual types of the > attributes that they need to read of modify. This way, when > boolean nextToken() is called, no casting is necessary. > I added a class called TestNewTokenStreamAPI which is not > really a test case yet, but has a static demo() method, which > demonstrates how to use the new API. > The reason to not merge Token and TokenStream into one class > is that we might have caching (or tee/sink) filters in the > chain that might want to store cloned copies of the tokens > in a cache. I added a new class NewCachingTokenStream that > shows how such a class could work. I also implemented a deep > clone method in AttributedToken and a > copyFrom(AttributedToken) method, which is needed for the > caching. Both methods have to iterate over the list of > attributes. The Attribute subclasses itself also have a > copyFrom(Attribute) method, which unfortunately has to down- > cast to the actual type. I first thought that might be very > inefficient, but it's not so bad. Well, if you add all > Attributes to the AttributedToken that our old Token class > had (like offsets, payload, posIncr), then the performance > of the caching is somewhat slower (~40%). However, if you > add less attributes, because not all might be needed, then > the performance is even slightly faster than with the old API. > Also the new API is flexible enough so that someone could > implement a custom caching filter that knows all attributes > the token can have, then the caching should be just as > fast as with the old API. > This patch is not nearly ready, there are lot's of things > missing: > - unit tests > - change DocumentsWriter to use new API > (in backwards-compatible fashion) > - patch is currently java 1.5; need to change before > commiting to 2.9 > - all TokenStreams and -Filters should be changed to use > new API > - javadocs incorrect or missing > - hashcode and equals methods missing in Attributes and > AttributedToken > > I wanted to submit it already for brave people to give me > early feedback before I spend more time working on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]