[jira] Commented: (LUCENE-1422) New TokenStream API

Michael Busch (JIRA) Sun, 09 Nov 2008 18:07:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646146#action_12646146
 ]


Michael Busch commented on LUCENE-1422:
---------------------------------------

{quote}
    * In DocInverterPerField you check if the attrSource has posIncr &
      offset attrs, instead of just looking them up and getting the
      default instance if it wasn't already there. This then requires
      you to default posIncr & offsets in two places -
      DocInverterPerField & each of the attr classes. Wouldn't it be
      cleaner to always look them up, and let the default instance of
      each be the one place that holds the default?
{quote}

So here's one thing about the new API that is not entirely clear to me yet. 
Let's say the consumer, in this case DocInverterPerField, wants to use the 
offset. You are right, it's not desirable to define default values in two 
places (the consumer and the attribute). Now the other alternative is to call 
addAttribute(OffsetAttribute.class) in the consumer. If no offset attribute is 
present, this will just add a default instance and the consumer can use its 
default value. So far so good. But what if there is e. g. a TokenFilter in the 
chain that checks input.hasAttribute(OffsetAttribute.class) and does something 
it would not do if that check returned false? It seems "something" is not 
clearly defined here yet. I think the "something" is the difference between a 
producer (TokenStream/Filter) adding an attribute vs. a consumer. Thoughts?

Hmm or maybe in this case the consumer is not behaving correctly. If there is 
no offset attribute, then the consumer should not have to write any data 
structures that use an offset, right? That's why you currently have the start() 
method in TermVectorTermsWriterPerField that checks if offsets need to be 
written or not.
I think the issue here is that one consumer, the DocInverterPerField, deals 
with one value, the offset, that multiple subconsumers might want to use. 
That's why we don't want to throw an exception in DocInverterPerField if 
OffsetAttribute is not present, because it can't know if this is really a 
misconfiguration or not. Only the actual subconsumer (in this case 
TermVectorTermsWriterPerField) is able to decide that. For example if someone 
writes a new custom consumer that can only work if OffsetAttribute is present, 
then we would probably want that consumer to throw an exception if it's not 
present. But if the DocInverterPerField adds the OffsetAttribute to be able to 
use it's default value, we would not get the exception. 

But maybe all this is fine and we should just say we strictly separate the 
Attributes from the configuration, which means that the knowledge about the 
presence or absence of an Attribute may not be used by anybody (filter or 
consumer) to determine anything about the configuration. This is how the 
TermVectors currently work. The configuration (whether or not to store 
positions and/or offsets) is stored in the Field, the data is carried by the 
OffsetAttribute. If we decide to define the API this way, do we even need 
AttributeSource.hasAttribute() and getAttribute()? Or wouldn't addAttribute() 
be sufficient then?

> New TokenStream API
> -------------------
>
>                 Key: LUCENE-1422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1422
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1422-take4.patch, lucene-1422.patch, 
> lucene-1422.take2.patch, lucene-1422.take3.patch, lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1422) New TokenStream API

Reply via email to