RE: new TokenStream api Question

Uwe Schindler Tue, 28 Apr 2009 01:46:27 -0700

Haha, isn't it funny, the same idea came to me on Sunday afternoon after I
answered to Eks Dev. But I have thrown it away, because interfaces are not
liked here. :-)


 

This new interface may also prevent us from using these useNewAPI() calls,
as the old TokenStream methods could be easily implemented/wrapped using the
standard Token instance, too. About the "interface problem": We do not have
to think about interface extensions in future. If one needs a new attribute
member, he can just invent a new Attribute and add it (like ShiftAttribute
in TrieRange). An interface once defined, does not need to be changed
anymore.

 

The new API then needs some "factory" to generate the attribute instances,
e.g. if one adds all 4 attributes (term, posincr, offset, type), only one
instance must be created and all mappings in the interface point to this
instance. Do you have an idea, how to implement this?  It should be
extensible, so each TokenStream can register its own factory, but maybe
defaults to something etc.

 

+1

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _____  

From: Michael Busch [mailto:[email protected]] 
Sent: Tuesday, April 28, 2009 10:23 AM
To: [email protected]
Subject: Re: new TokenStream api Question

 

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to
overcome one drawback: with the variables now distributed over various
Attribute classes (vs. being in a single class Token previously), cloning a
"Token" (i.e. calling captureState()) is more expensive. This slows down the
CachingTokenFilter and Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes.
E.g. OffsetAttribute would then be an interface with all current methods,
and OffsetAttributeImpl would be its implementation. The user would still
use the API in exactly the same way as now, that is be e.g. calling
addAttribute(OffsetAttribute.class), and the code takes care of
instantiating the right class. However, there would then also be an API to
pass in an actual instance, and this API would use reflection to find all
interfaces that the instances implements. All of those interfaces that
extend the Attribute interface would be added to the AttributeSource map,
with the instance as the value.

Then the Token class would implement all six attribute interfaces. An expert
user could decide to pass in a Token instance instead of calling
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that needs to
be cloned in captureState(), making cloning much faster. And a (probably
also expert) user could even implement an own class that implements exactly
the necessary interfaces (maybe only 3 of the 6 provided), and make cloning
faster than it is even with the old Token-based API.

And of course also in your case could you just create a different
implementation of such an interface, right? I think what's nice about this
change is that it doesn't make it more complicated to use the TokenStream
API, and the indexing pipeline still uses it the same way too, yet it's more
extensible more expert users and possible to achieve the same or even better
cloning performance.

I will open a new Jira issue for this soon. But I'd be happy to hear
feedback about the proposed changes, and especially if you think these
changes would help you for your usecase.

-Michael

On 4/27/09 1:49 PM, eks dev wrote: 

Should I create a patch with something like this? 
 
With "Expert" javadoc, and explanation what is this good for should be a
nice addition to Attribute cases.
Practically, it would enable specialization of "hard linked" Attributes like
TermAttribute. 
 
The only preconditions are: 
 
- "Specialized Attribute" must extend one of the "hard linked" ones, and
provide class of it
- Must implement default constructor 
- should extend by not introducing state (big majority of cases) (not to
break captureState())
 
The last one could be relaxed i guess, but I am not yet 100% familiar with
this code.
 
Use cases for this are along the lines of my example, smaller, easier user
code and performance (token filters mainly)
 
 
 
----- Original Message ----
  

From: Uwe Schindler  <mailto:[email protected]> <[email protected]>
To: [email protected]
Sent: Sunday, 26 April, 2009 23:03:06
Subject: RE: new TokenStream api Question
 
There is one problem: if you extend TermAttribute, the class is different
(which is the key in the attributes list). So when you initialize the
TokenStream and do a
 
YourClass termAtt = (YourClass) addAttribute(YourClass.class)
 
...you create a new attribute. So one possibility would be to also specify
the instance and save the attribute by class (as key), but with your
instance. If you are the first one that creates the attribute (if it is a
token stream and not a filter it is ok, you will be the first, it adding the
attribute in the ctor), everything is ok. Register the attribute by yourself
(maybe we should add a specialized addAttribute, that can specify a instance
as default)?:
 
YourClass termAtt = new YourClass();
attributes.put(TermAttribute.class, termAtt);
 
In this case, for the indexer it is a standard TermAttribute, but you can
more with it.
 
Replacing TermAttribute by an own class is not possible, as the indexer will
get a ClassCastException when using the instance retrieved with
getAttribute(TermAttribute.class).
 
Uwe
 
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]
 
    

-----Original Message-----
From: eks dev [mailto:[email protected]]
Sent: Sunday, April 26, 2009 10:39 PM
To: [email protected]
Subject: new TokenStream api Question
 
 
I am just looking into new TermAttribute usage and wonder what would be
the best way to implement PrefixFilter that would filter out some Terms
that have some prefix,
 
something like this, where '-' represents my prefix:
 
  public final boolean incrementToken() throws IOException {
    // the first word we found
    while (input.incrementToken()) {
      int len = termAtt.termLength();
 
      if(len > 0 && termAtt.termBuffer()[0]!='-') //only length > 0 and
non LFs
    return true;
      // note: else we ignore it
    }
    // reached EOS
    return false;
  }
 
 
 
 
 
The question would be:
 
can I extend TermAttribute and add boolean startsWith(char c);
 
The point is speed and my code gets smaller.
TermAttribute has one method called in termLength() and termBuffer() I do
not understand (back compatibility, I guess)
  public int termLength() {
    initTermBuffer(); // I'd like to avoid it...
    return termLength;
  }
 
 
I'd like to get rid of initTermBuffer(), the first option is to *extend*
TermAttribute code (but fields are private, so no help there) or can I
implement my own MyTermAttribute (will Indexer know how to deal with it?)
 
Must I extend TermAttribute or I can add my own?
 
thanks,
eks
 
 
 
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
      

 
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
    

 
 
 
      
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: new TokenStream api Question

Reply via email to