On 8/14/09 9:23 AM, Yonik Seeley wrote:
On Thu, Aug 13, 2009 at 4:32 PM, Michael Busch<busch...@gmail.com> wrote:
On 8/13/09 7:29 AM, Yonik Seeley wrote:
I'm liking the new attribute based analysis (in conjunction with
reusability), but I'm running into some questions...
Is it valid for tokenizers or token filters add new attributes after
their constructor (after they have processed some tokens)?
At the moment we're saying in the javadocs of TokenStream that all
Attributes should be
added up front.
Hmmm, OK... in which case, token producers using restoreState() would
not have to call clearAttributes() first.
We could change these semantics. I had some thoughts about
it in the original
JIRA issue (LUCENE-1422).
Apologies if I'm rehashing anything - it's hard to keep up with some
of those monster (high volume) issues.
So back to your question if we should allow restoreState() to add attributes
and use a state across different AttributeSources: the complication is that we
can only
allow that if the different AttributeSource were filled using the same
AttributeFactory, otherwise
different AtttributeImpls could be in the sources and the copying wouldn't
work anymore.
Hmmm, so perhaps just an assertion that the factories are equal... and
documentation saying that moving state from one stream to the other
requires identical factories? Anyway, I don't currently have a use
case for this... I was just wondering.
Yes that should work. We basically have such an assertion in
TeeSinkTokenFilter:
public void addSinkTokenStream(final SinkTokenStream sink) {
// check that sink has correct factory
if (!this.getAttributeFactory().equals(sink.getAttributeFactory())) {
throw new IllegalArgumentException("The supplied sink is not
compatible to this tee");
}
So I agree, we should just do the same in restoreState().
Another thing I was wondering about was the opacity of State - one
can't inspect or change the attributes w/o restoring it first.
Undesirable limitation, or feature allowing more flexible state
implementations?
Excellent point! This limitation is currently there to discourage
changing values of
a state, because that would be rather inefficient: you'd have to lookup
the attribute(s)
of each state you want to change. We could write a StateContainer, which
has an API
to access states in an efficient way (iterator, random access), using
delegation.
When I changed the contrib TokenStreams this limitation was somewhat
annoying
for some streams - but in all cases it was possible to implement the
streams far more
efficient by avoiding excessive caching. (except ShingleMatrixFilter, I
gave up eventually,
not knowing that code at all).
So I agree we should come up with a good API here for convenience, but
mention
in the javadocs that it should only be used carefully.
-Yonik
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org