Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai

Reply via email to