RE: Looking for a code pattern to pass stop words as an attribute

Uwe Schindler Wed, 22 Aug 2012 01:04:10 -0700

You could misuse the attributes API:

All filters in a chain have the same attributes. This is achieved by the 
chaining (new TokenFilter(other TS) shares the attributes). What you could do 
to be non-linear in chaining:


 

Create the "helpers" that are not part of the chain, by linking them to the 
input TokenStream, but never call incrementToken() on them. Their internals 
will always see the same attributes and attribute contents, so you could call 
accept() - if it would not be protected. The stream is controlled by our 
TokenFilter, so we incrementToken() only on ours, we just misuse the accept 
method (because it operates on the attributes we already populated by our own 
call to incrementToken()):

 

stopwordMarkFilter = new TokenFilter(....) {

                private final markerAtt = addAttribute(...);

                private final FilteringTokenFilter japanesePOS = new new 
JapanesePartOfSpeechStopFilter(true, input, stoptags);

                private final FilteringTokenFilter stopfilter = new 
StopFilter(matchVersion, input, stopwords);

 

                public boolean incrementToken() {

                               if (!input.incrementToken()) return false;

                               if (!japanesePOS.accept() || 
!stopfilter.accept()) {

                                               // mark the current token as a 
stopword.

                                               markerAtt.setIsStopword(true);

                               }

                               return true;

                }

}

 

The only problem, as accept is not intended to be called from the outside, it 
is of course protected...

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [email protected]

 

> -----Original Message-----

> From: [email protected] [mailto:[email protected]] On Behalf Of

> Dawid Weiss

> Sent: Wednesday, August 22, 2012 8:51 AM

> To: [email protected]

> Subject: Re: Looking for a code pattern to pass stop words as an attribute

> 

> Thanks for replies Steve, Uwe.

> 

> > if you dont want to create your own "marker filter", you can use

> > KeywordMarkerFilter ( <http://goo.gl/OOgf4> http://goo.gl/OOgf4) instead

> 

> This is pretty much what I had come up with, although I used a custom filter

> class (with a similar attribute). The thing I have trouble with is, however, 
> that

> stop words may not be based on images but also on other attributes. In

> particular, the Japanese pipeline uses _two_ term suppression classes:

> 

>     stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);

>     ...

>     stream = new StopFilter(matchVersion, stream, stopwords);

> 

> Of course I can just copy/paste the source of these and build my own keyword

> marker, this is clear to me. But I'd rather build a filter that delegates to 
> these

> original classes and aggregates their output so that I don't have to rebuild

> things on every upgrade and this is where I'm kind of stuck.  Something like:

> 

> if (!japanesePOS.accept() || !stopfilter.accept()) {

>   // mark the current token as a stopword.

> }

> 

> I'm just not sure if I can create such a non-linear filters pipeline

> -- if this isn't going to confuse the attribute management code? Node that the

> above filters (japanesePOS, blah) would _not_ be part of the token stream, the

> would be attached to one of the filters. Don't know if I'm clear.

> 

> Dawid

> 

> ---------------------------------------------------------------------

> To unsubscribe, e-mail:  <mailto:[email protected]> 
> [email protected] For additional

> commands, e-mail:  <mailto:[email protected]> 
> [email protected]

RE: Looking for a code pattern to pass stop words as an attribute

Reply via email to