You could misuse the attributes API:
All filters in a chain have the same attributes. This is achieved by the
chaining (new TokenFilter(other TS) shares the attributes). What you could do
to be non-linear in chaining:
Create the "helpers" that are not part of the chain, by linking them to the
input TokenStream, but never call incrementToken() on them. Their internals
will always see the same attributes and attribute contents, so you could call
accept() - if it would not be protected. The stream is controlled by our
TokenFilter, so we incrementToken() only on ours, we just misuse the accept
method (because it operates on the attributes we already populated by our own
call to incrementToken()):
stopwordMarkFilter = new TokenFilter(....) {
private final markerAtt = addAttribute(...);
private final FilteringTokenFilter japanesePOS = new new
JapanesePartOfSpeechStopFilter(true, input, stoptags);
private final FilteringTokenFilter stopfilter = new
StopFilter(matchVersion, input, stopwords);
public boolean incrementToken() {
if (!input.incrementToken()) return false;
if (!japanesePOS.accept() ||
!stopfilter.accept()) {
// mark the current token as a
stopword.
markerAtt.setIsStopword(true);
}
return true;
}
}
The only problem, as accept is not intended to be called from the outside, it
is of course protected...
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Dawid Weiss
> Sent: Wednesday, August 22, 2012 8:51 AM
> To: [email protected]
> Subject: Re: Looking for a code pattern to pass stop words as an attribute
>
> Thanks for replies Steve, Uwe.
>
> > if you dont want to create your own "marker filter", you can use
> > KeywordMarkerFilter ( <http://goo.gl/OOgf4> http://goo.gl/OOgf4) instead
>
> This is pretty much what I had come up with, although I used a custom filter
> class (with a similar attribute). The thing I have trouble with is, however,
> that
> stop words may not be based on images but also on other attributes. In
> particular, the Japanese pipeline uses _two_ term suppression classes:
>
> stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);
> ...
> stream = new StopFilter(matchVersion, stream, stopwords);
>
> Of course I can just copy/paste the source of these and build my own keyword
> marker, this is clear to me. But I'd rather build a filter that delegates to
> these
> original classes and aggregates their output so that I don't have to rebuild
> things on every upgrade and this is where I'm kind of stuck. Something like:
>
> if (!japanesePOS.accept() || !stopfilter.accept()) {
> // mark the current token as a stopword.
> }
>
> I'm just not sure if I can create such a non-linear filters pipeline
> -- if this isn't going to confuse the attribute management code? Node that the
> above filters (japanesePOS, blah) would _not_ be part of the token stream, the
> would be attached to one of the filters. Don't know if I'm clear.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: <mailto:[email protected]>
> [email protected] For additional
> commands, e-mail: <mailto:[email protected]>
> [email protected]