On 8/13/09 7:29 AM, Yonik Seeley wrote:
I'm liking the new attribute based analysis (in conjunction with
reusability), but I'm running into some questions...

Is it valid for tokenizers or token filters add new attributes after
their constructor (after they have processed some tokens)?


At the moment we're saying in the javadocs of TokenStream that all Attributes should be added up front. We could change these semantics. I had some thoughts about it in the original
JIRA issue (LUCENE-1422).

Should restoreState() be able to add attributes (it currently throws
an exception)?  If not, does that mean that it's not supported/advised
to use state across different TokenStreams?


See answer below.

We've previously seen that the native java clone() can be much slower
than implementing it ourselves in Java.  Should we have our own
clone() method on Attribute?  Or just implement clone() ourselves and
require that subclasses override if needed?  This is inner-loop
per-token stuff, and a single captureState() will invoke many clone
operations (6 attributes make up the legacy Token object).


Improving the cloning performance was actually the main reason for LUCENE-1693. It separates the Attribute interfaces from the actual implementation, and as you probably know Token now implements all token attributes. So in a TokenStream chain which does cloning (e.g. with a TeeSinkTokenFilter or CachingTokenFilter) one could use a different
AttributeFactory to get much better performance.

The AttributeSource builds internally a simple linked list (State), which captureState() clones then by calling the clone() method of the AttributeImpls. Using the linked list approach performed best for me. We could change the implementations of the clone() methods of the AttributeImpls or even add our own clone method if performance would improve.

The nice thing about LUCENE-1693 is that if cloning performance is really crucial for your usecase you can simply implement a class that only implements the token attributes you need. Often term, positionIncrement and offset is enough. Then the object to be cloned is smaller.

Ideally it'd be cool if we could synthesize a class automatically during runtime that implements all Attribute interfaces in use, but I think with java you can only do that if you add a special
jar from the JDK to the classpath.

So back to your question if we should allow restoreState() to add attributes and use a state across different AttributeSources: the complication is that we can only allow that if the different AttributeSource were filled using the same AttributeFactory, otherwise different AtttributeImpls could be in the sources and the copying wouldn't work anymore.

I didn't find a good (efficient) way of doing the cloning/copying per Attribute interface yet, which I did it this way. I'll try to think about if a bit more.... maybe you have an idea?!

 Michael
-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to