Thanks Michael
LUCENE-1301 was exactly what I was looking for to complete my understanding.
All is clear now
On Sun, Oct 12, 2008 at 10:42 AM, Michael Busch <[EMAIL PROTECTED]> wrote:
> Hi Shai,
>
> I'm going to shuffle your email a bit to answer in a different order...
>
> Shai Erera wrote:
> > Perhaps what you write below is linked to another thread (on flexible
> > indexing maybe?) which I'm not aware of, so I'd appreciate if you can
> > give me a reference.
>
> Mike recently refactored the DocumentsWriter into a bunch of new classes
> (see LUCENE-1301). In fact, we now have an indexer chain, in which you can
> plugin different modules that do something (well, we still have to add an
> API to make use of the chain...)
>
> For example, there are currently two TermsHashConsumers in the default
> chain: the FreqProxTermsWriter and the TermVectorTermsWriter. Both consume
> the tokens from a TokenStream and write the different data structures.
>
> We could for example write a SpanPostingTermsHashConsumer that can not only
> write the start position of a token but also the number of covered
> positions. We could introduce a new interface:
> public interface SpanAttribute extends PositionIncrementAttribute {
> public int getLength();
> public int setLength(int length);
> }
>
> Only the SpanPostingTermsHashConsumer would need to know the SpanAttribute.
>
>
>> BTW, what I didn't understand from you description is how does the
>> indexing part know which attributes my Token supports? For example, let's
>> say I create a Token which implements only position increments, no payload
>> and perhaps some other custom attribute. I generate a TokenStream returning
>> this Token type.
>> How will Lucene's indexing mechanism know my Token supports only position
>> increments and especially the custom attribute? What will it do with that
>> custom attribute?
>>
>
> The advantage is that the different consumers actually don't need to know
> the exact type of the Token. Each consumer can check via instanceof if the
> prototype Token actually implements the interface(s) that the consumer
> needs. If not, then the consumer can just not process the tokens for that
> particular field. Alternatively we could say that the user needs to make
> sure that the appropriate prototype Token is generated for the indexing
> chain that is configured, otherwise Lucene throws an Exception.
>
> I think the main advantage here is that we can implement consumers that
> only care about particular attributes. Btw, Doug had actually a very similar
> idea for the Token class that he mentioned almost 2 years ago:
> http://www.gossamer-threads.com/lists/lucene/java-dev/43486#43486
>
> > In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java
> > templates then? Have the calling application pass in the Token
> > template it wants to use and then the consumer does not need to cast
> > anything ...
>
> That only works if we keep the current design in which the consumer has to
> create the Token. But what do you do if you have more than one consumer? (E.
> g. adding a new TermsHashConsumer into the chain?)
>
> -Michael
>
>
>> Shai
>>
>>
>> On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch <[EMAIL PROTECTED]<mailto:
>> [EMAIL PROTECTED]>> wrote:
>>
>> Hi,
>>
>> I've been thinking about making the TokenStream and Token APIs more
>> flexible. E. g. for fields that don't store positions, the Token
>> doesn't need to have a positionIncrement or a payload. With flexible
>> indexing on the other hand, people might want to add custom
>> attributes to a Token that a consumer in the indexing chain could
>> use then.
>>
>> Of course it is possible to extend Token, because it is not final,
>> and add additional attributes to it. But then consumers of the
>> TokenStream must downcast every instance of the Token object when
>> they call next(Token).
>>
>> I was therefore thinking about a different TokenStream API:
>>
>> public abstract class TokenStream {
>> public abstract boolean nextToken() throws IOException;
>>
>> public abstract Token prototypeToken() throws IOException;
>>
>> public void reset() throws IOException {}
>>
>> public void close() throws IOException {}
>> }
>>
>> Furthermore Token itself would only keep the termBuffer logic and we
>> could introduce different interfaces, like:
>>
>> public interface PayloadAttribute {
>> /**
>> * Returns this Token's payload.
>> */
>> public Payload getPayload();
>>
>> /**
>> * Sets this Token's payload.
>> */
>> public void setPayload(Payload payload);
>> }
>>
>> public interface PositionIncrementAttribute {
>> /** Set the position increment. This determines the position of
>> * this token relative to the previous Token in a
>> * [EMAIL PROTECTED] TokenStream}, used in phrase searching.
>> */
>> public void setPositionIncrement(int positionIncrement);
>>
>> /** Returns the position increment of this Token.
>> * @see #setPositionIncrement
>> */
>> public int getPositionIncrement();
>> }
>>
>> A consumer, e. g. the DocumentsWriter, does not create a Token
>> instance itself anymore, but rather calls prototypeToken(). This
>> method returns a Token subclass which implements all desired
>> *Attribute interfaces.
>>
>> If a consumer is e. g. only interested in the positionIncrement and
>> Payload, it can consume the tokens like this:
>>
>> public class Consumer {
>> public void consumeTokens(TokenStream ts) throws IOException {
>> Token token = ts.prototypeToken();
>>
>> PayloadAttribute payloadSource = (PayloadAttribute) token;
>> PositionIncrementAttribute positionSource =
>> (PositionIncrementAttribute) token;
>>
>> while (ts.nextToken()) {
>> char[] term = token.termBuffer();
>> int termLength = token.termLength();
>> int positionIncrement = positionSource.getPositionIncrement();
>> Payload payload = payloadSource.getPayload();
>>
>> // do something with the term, positionIncrement and payload
>> }
>> }
>> }
>>
>> Casting is now only done once after the prototype token was created.
>> Now if you want to add another consumer in the indexing chain and
>> realize that you want to add another attribute to the Token, then
>> you don't have to change this consumer. You only need to create
>> another Token subclass that implements the new attribute in addition
>> to the previous ones and can use it in the new consumer.
>>
>> I haven't tried to implement this yet and maybe there are things I
>> haven't thought about (like caching TokenFilters). I'd like to get
>> some feedback about these APIs first to see if this makes sense?
>>
>> Btw: if we think this (or another) approach to change these APIs
>> makes sense, then it would be good to change it for 3.0 when we can
>> break backwards compatibility. And then we should also rethink the
>> Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible
>> indexing!
>>
>> -Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>