On 8/13/09 7:29 AM, Yonik Seeley wrote:
I'm liking the new attribute based analysis (in conjunction with
reusability), but I'm running into some questions...
Is it valid for tokenizers or token filters add new attributes after
their constructor (after they have processed some tokens)?
At the moment we're saying in the javadocs of TokenStream that all
Attributes should be
added up front. We could change these semantics. I had some thoughts
about it in the original
JIRA issue (LUCENE-1422).
Should restoreState() be able to add attributes (it currently throws
an exception)? If not, does that mean that it's not supported/advised
to use state across different TokenStreams?
See answer below.
We've previously seen that the native java clone() can be much slower
than implementing it ourselves in Java. Should we have our own
clone() method on Attribute? Or just implement clone() ourselves and
require that subclasses override if needed? This is inner-loop
per-token stuff, and a single captureState() will invoke many clone
operations (6 attributes make up the legacy Token object).
Improving the cloning performance was actually the main reason for
LUCENE-1693.
It separates the Attribute interfaces from the actual implementation,
and as you probably
know Token now implements all token attributes. So in a TokenStream
chain which does
cloning (e.g. with a TeeSinkTokenFilter or CachingTokenFilter) one could
use a different
AttributeFactory to get much better performance.
The AttributeSource builds internally a simple linked list (State),
which captureState()
clones then by calling the clone() method of the AttributeImpls. Using
the linked list approach
performed best for me. We could change the implementations of the
clone() methods of
the AttributeImpls or even add our own clone method if performance would
improve.
The nice thing about LUCENE-1693 is that if cloning performance is
really crucial for your
usecase you can simply implement a class that only implements the token
attributes you need.
Often term, positionIncrement and offset is enough. Then the object to
be cloned is smaller.
Ideally it'd be cool if we could synthesize a class automatically during
runtime that implements
all Attribute interfaces in use, but I think with java you can only do
that if you add a special
jar from the JDK to the classpath.
So back to your question if we should allow restoreState() to add
attributes and use a state
across different AttributeSources: the complication is that we can only
allow that if
the different AttributeSource were filled using the same
AttributeFactory, otherwise
different AtttributeImpls could be in the sources and the copying
wouldn't work anymore.
I didn't find a good (efficient) way of doing the cloning/copying per
Attribute interface yet,
which I did it this way. I'll try to think about if a bit more.... maybe
you have an idea?!
Michael
-Yonik
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org