Re: new TokenStream api Question

eks dev Tue, 28 Apr 2009 04:11:38 -0700

Hi Michael,
Sure, the Interfaces are solution to this. They define what Lucene core expects 
from these entities and gives freedom to people to provide any implementation 
they wish. E.g.  users that do not need Offset information, can just provide 
dummy implementation that returns constants...


The only problem with Interfaces is back compatibility curse :)  

But!
 Attribute Offset is simple enough entity, so I do not believe there is a need 
ever to change an interface 
Term is just char[] with offset/length , the same. 

Having really simple (and keeping them simple)  concepts behind  makes 
Interfaces possible... I see no danger. But as said, the concepts behind must 
remain simple.
  

And by the way, I like the new API.  

Cheers, Eks



________________________________
From: Michael Busch <busch...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Tuesday, 28 April, 2009 10:22:45
Subject: Re: new TokenStream api Question

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to overcome 
one drawback: with the variables now distributed over various Attribute classes 
(vs. being in a single class Token previously), cloning a "Token" (i.e. calling 
captureState()) is more expensive. This slows down the CachingTokenFilter and 
Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes. E.g. 
OffsetAttribute would then be an interface with all current methods, and 
OffsetAttributeImpl would be its implementation. The user would still use the 
API in exactly the same way as now, that is be e.g. calling 
addAttribute(OffsetAttribute.class), and the code takes care of instantiating 
the right class. However, there would then also be an API to pass in an actual 
instance, and this API would use reflection to find all interfaces that the 
instances implements. All of those interfaces that extend the Attribute 
interface would be added to the AttributeSource map, with the instance as the 
value.

Then the Token class would implement all six attribute interfaces. An expert 
user could decide to pass in a Token instance instead of calling 
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that needs to be 
cloned in captureState(), making cloning much faster. And a (probably also 
expert) user could even implement an own class that implements exactly the 
necessary interfaces (maybe only 3 of the 6 provided), and make cloning faster 
than it is even with the old Token-based API.

And of course also in your case could you just create a different 
implementation of such an interface, right? I think what's nice about this 
change is that it doesn't make it more complicated to use the TokenStream API, 
and the indexing pipeline still uses it the same way too, yet it's more 
extensible more expert users and possible to achieve the same or even better 
cloning performance.

I will open a new Jira issue for this soon. But I'd be happy to hear feedback 
about the proposed changes, and especially if you think these changes would 
help you for your usecase.

-Michael

On 4/27/09 1:49 PM, eks dev wrote: 
Should I create a patch with something like this?     With "Expert" javadoc, 
and explanation what is this good for should be a nice addition to Attribute 
cases.  Practically, it would enable specialization of "hard linked" Attributes 
like TermAttribute.     The only preconditions are:     - "Specialized 
Attribute" must extend one of the "hard linked" ones, and provide class of it  
- Must implement default constructor   - should extend by not introducing state 
(big majority of cases) (not to break captureState())    The last one could be 
relaxed i guess, but I am not yet 100% familiar with this code.    Use cases 
for this are along the lines of my example, smaller, easier user code and 
performance (token filters mainly)        ----- Original Message ----    
From: Uwe Schindler <u...@thetaphi.de>  To: java-dev@lucene.apache.org  Sent: 
Sunday, 26 April, 2009 23:03:06  Subject: RE: new TokenStream api Question    
There is one problem: if you extend TermAttribute, the class is different  
(which is the key in the attributes list). So when you initialize the  
TokenStream and do a    YourClass termAtt = (YourClass) 
addAttribute(YourClass.class)    ...you create a new attribute. So one 
possibility would be to also specify  the instance and save the attribute by 
class (as key), but with your  instance. If you are the first one that creates 
the attribute (if it is a  token stream and not a filter it is ok, you will be 
the first, it adding the  attribute in the ctor), everything is ok. Register 
the attribute by yourself  (maybe we should add a specialized addAttribute, 
that can specify a instance  as default)?:    YourClass termAtt = new 
YourClass();  attributes.put(TermAttribute.class, termAtt);    In this case, for
 the indexer it is a standard TermAttribute, but you can  more with it.    
Replacing TermAttribute by an own class is not possible, as the indexer will  
get a ClassCastException when using the instance retrieved with  
getAttribute(TermAttribute.class).    Uwe    -----  Uwe Schindler  
H.-H.-Meier-Allee 63, D-28213 Bremen  http://www.thetaphi.de  eMail: 
u...@thetaphi.de        
-----Original Message-----  From: eks dev [mailto:eks...@yahoo.co.uk]  Sent: 
Sunday, April 26, 2009 10:39 PM  To: java-dev@lucene.apache.org  Subject: new 
TokenStream api Question      I am just looking into new TermAttribute usage 
and wonder what would be  the best way to implement PrefixFilter that would 
filter out some Terms  that have some prefix,    something like this, where '-' 
represents my prefix:      public final boolean incrementToken() throws 
IOException {      // the first word we found      while 
(input.incrementToken()) {        int len = termAtt.termLength();          
if(len > 0 && termAtt.termBuffer()[0]!='-') //only length > 0 and  non LFs      
return true;        // note: else we ignore it      }      // reached EOS      
return false;    }            The question would be:    can I extend 
TermAttribute and add boolean startsWith(char c);    The point is speed and my 
code gets smaller.  TermAttribute has one method called in
 termLength() and termBuffer() I do  not understand (back compatibility, I 
guess)    public int termLength() {      initTermBuffer(); // I'd like to avoid 
it...      return termLength;    }      I'd like to get rid of 
initTermBuffer(), the first option is to *extend*  TermAttribute code (but 
fields are private, so no help there) or can I  implement my own 
MyTermAttribute (will Indexer know how to deal with it?)    Must I extend 
TermAttribute or I can add my own?    thanks,  eks          
---------------------------------------------------------------------  To 
unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org  For additional 
commands, e-mail: java-dev-h...@lucene.apache.org        
  ---------------------------------------------------------------------  To 
unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org  For additional 
commands, e-mail: java-dev-h...@lucene.apache.org      
              
---------------------------------------------------------------------  To 
unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org  For additional 
commands, e-mail: java-dev-h...@lucene.apache.org

Re: new TokenStream api Question

Reply via email to