Re: TokenStream and Token APIs

Michael Busch Sun, 12 Oct 2008 01:42:53 -0700

Hi Shai,

I'm going to shuffle your email a bit to answer in a different order...


Shai Erera wrote:
> Perhaps what you write below is linked to another thread (on flexible
> indexing maybe?) which I'm not aware of, so I'd appreciate if you can
> give me a reference.

Mike recently refactored the DocumentsWriter into a bunch of new classes(see LUCENE-1301). In fact, we now have an indexer chain, in which youcan plugin different modules that do something (well, we still have toadd an API to make use of the chain...)

For example, there are currently two TermsHashConsumers in the defaultchain: the FreqProxTermsWriter and the TermVectorTermsWriter. Bothconsume the tokens from a TokenStream and write the different datastructures.

We could for example write a SpanPostingTermsHashConsumer that can notonly write the start position of a token but also the number of coveredpositions. We could introduce a new interface:

  public interface SpanAttribute extends PositionIncrementAttribute {
    public int getLength();
    public int setLength(int length);
  }

Only the SpanPostingTermsHashConsumer would need to know the SpanAttribute.

BTW, what I didn't understand from you description is how does theindexing part know which attributes my Token supports? For example,let's say I create a Token which implements only position increments, nopayload and perhaps some other custom attribute. I generate aTokenStream returning this Token type.How will Lucene's indexing mechanism know my Token supports onlyposition increments and especially the custom attribute? What will it dowith that custom attribute?

The advantage is that the different consumers actually don't need toknow the exact type of the Token. Each consumer can check via instanceofif the prototype Token actually implements the interface(s) that theconsumer needs. If not, then the consumer can just not process thetokens for that particular field. Alternatively we could say that theuser needs to make sure that the appropriate prototype Token isgenerated for the indexing chain that is configured, otherwise Lucenethrows an Exception.

I think the main advantage here is that we can implement consumers thatonly care about particular attributes. Btw, Doug had actually a verysimilar idea for the Token class that he mentioned almost 2 years ago:

http://www.gossamer-threads.com/lists/lucene/java-dev/43486#43486

> In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java
> templates then? Have the calling application pass in the Token
> template it wants to use and then the consumer does not need to cast
> anything ...

That only works if we keep the current design in which the consumer hasto create the Token. But what do you do if you have more than oneconsumer? (E. g. adding a new TermsHashConsumer into the chain?)


-Michael


Shai

On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


    Hi,

    I've been thinking about making the TokenStream and Token APIs more
    flexible. E. g. for fields that don't store positions, the Token
    doesn't need to have a positionIncrement or a payload. With flexible
    indexing on the other hand, people might want to add custom
    attributes to a Token that a consumer in the indexing chain could
    use then.

    Of course it is possible to extend Token, because it is not final,
    and add additional attributes to it. But then consumers of the
    TokenStream must downcast every instance of the Token object when
    they call next(Token).

    I was therefore thinking about a different TokenStream API:

     public abstract class TokenStream {
       public abstract boolean nextToken() throws IOException;

       public abstract Token prototypeToken() throws IOException;

       public void reset() throws IOException {}

       public void close() throws IOException {}
     }

    Furthermore Token itself would only keep the termBuffer logic and we
    could introduce different interfaces, like:

     public interface PayloadAttribute {
       /**
        * Returns this Token's payload.
        */
       public Payload getPayload();

       /**
        * Sets this Token's payload.
        */
       public void setPayload(Payload payload);
     }

     public interface PositionIncrementAttribute {
       /** Set the position increment.  This determines the position of
        *  this token relative to the previous Token in a
        * [EMAIL PROTECTED] TokenStream}, used in phrase searching.
        */
       public void setPositionIncrement(int positionIncrement);

       /** Returns the position increment of this Token.
        * @see #setPositionIncrement
        */
       public int getPositionIncrement();
     }

    A consumer, e. g. the DocumentsWriter, does not create a Token
    instance itself anymore, but rather calls prototypeToken(). This
    method returns a Token subclass which implements all desired
    *Attribute interfaces.

    If a consumer is e. g. only interested in the positionIncrement and
    Payload, it can consume the tokens like this:

     public class Consumer {
       public void consumeTokens(TokenStream ts) throws IOException {
         Token token = ts.prototypeToken();

         PayloadAttribute payloadSource = (PayloadAttribute) token;
         PositionIncrementAttribute positionSource =
                       (PositionIncrementAttribute) token;

         while (ts.nextToken()) {
           char[] term = token.termBuffer();
           int termLength = token.termLength();
           int positionIncrement = positionSource.getPositionIncrement();
           Payload payload = payloadSource.getPayload();

           // do something with the term, positionIncrement and payload
         }
       }
     }

    Casting is now only done once after the prototype token was created.
    Now if you want to add another consumer in the indexing chain and
    realize that you want to add another attribute to the Token, then
    you don't have to change this consumer. You only need to create
    another Token subclass that implements the new attribute in addition
    to the previous ones and can use it in the new consumer.

    I haven't tried to implement this yet and maybe there are things I
    haven't thought about (like caching TokenFilters). I'd like to get
    some feedback about these APIs first to see if this makes sense?

    Btw: if we think this (or another) approach to change these APIs
    makes sense, then it would be good to change it for 3.0 when we can
    break backwards compatibility. And then we should also rethink the
    Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible
    indexing!

    -Michael

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>
    For additional commands, e-mail: [EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

Reply via email to