Re: attribute thoughts

Michael Busch Thu, 13 Aug 2009 13:31:01 -0700

On 8/13/09 7:29 AM, Yonik Seeley wrote:

I'm liking the new attribute based analysis (in conjunction with
reusability), but I'm running into some questions...


Is it valid for tokenizers or token filters add new attributes after
their constructor (after they have processed some tokens)?

At the moment we're saying in the javadocs of TokenStream that allAttributes should beadded up front. We could change these semantics. I had some thoughtsabout it in the original

JIRA issue (LUCENE-1422).

Should restoreState() be able to add attributes (it currently throws
an exception)?  If not, does that mean that it's not supported/advised
to use state across different TokenStreams?


See answer below.

We've previously seen that the native java clone() can be much slower
than implementing it ourselves in Java.  Should we have our own
clone() method on Attribute?  Or just implement clone() ourselves and
require that subclasses override if needed?  This is inner-loop
per-token stuff, and a single captureState() will invoke many clone
operations (6 attributes make up the legacy Token object).

Improving the cloning performance was actually the main reason forLUCENE-1693.It separates the Attribute interfaces from the actual implementation,and as you probablyknow Token now implements all token attributes. So in a TokenStreamchain which doescloning (e.g. with a TeeSinkTokenFilter or CachingTokenFilter) one coulduse a different

AttributeFactory to get much better performance.

The AttributeSource builds internally a simple linked list (State),which captureState()clones then by calling the clone() method of the AttributeImpls. Usingthe linked list approachperformed best for me. We could change the implementations of theclone() methods ofthe AttributeImpls or even add our own clone method if performance wouldimprove.

The nice thing about LUCENE-1693 is that if cloning performance isreally crucial for yourusecase you can simply implement a class that only implements the tokenattributes you need.Often term, positionIncrement and offset is enough. Then the object tobe cloned is smaller.

Ideally it'd be cool if we could synthesize a class automatically duringruntime that implementsall Attribute interfaces in use, but I think with java you can only dothat if you add a special

jar from the JDK to the classpath.

So back to your question if we should allow restoreState() to addattributes and use a stateacross different AttributeSources: the complication is that we can onlyallow that ifthe different AttributeSource were filled using the sameAttributeFactory, otherwisedifferent AtttributeImpls could be in the sources and the copyingwouldn't work anymore.

I didn't find a good (efficient) way of doing the cloning/copying perAttribute interface yet,which I did it this way. I'll try to think about if a bit more.... maybeyou have an idea?!


 Michael

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: attribute thoughts

Reply via email to