> A question before I dive into coding a fix: can I assume (for > all analyzers) that the tokens produced by the tokenStream > have the following property: > currentToken.startOffset() >= lastToken.startOffset() > > The analyzers I have tested the highlighter with so far have > the property: > currentToken.startOffset() > lastToken.endOffset() > so aren't overlapping but I understand this isn't the case for > others (all demonstrable examples of such "problem" analyzers > would be appreciated for testing purposes).
There is such an analyzer here http://savannah.nongnu.org/projects/aramorph . > If I can assume that tokenstreams always produce a zero or more > increment in token.startOffset I think I can > design a solution that still works using a single pass of the > token stream. > I suspect an additional "flushText" method will be required on > the Formatter interface to allow implementations > to use a buffer. This buffer would be required to accumulate > overlapping token scores when trying to decide if a > section of the original text merited any highlight markup. I am not familiar with your most recent highlighter package, but I have implemented this myself with some older rudimentary highlighting code that just uses a Vector to keep track of all tokens for the same offset positions. Highlighting based on those tokens accumulated in the Vector is triggered when currentToken.startOffset() > lastToken.startOffset() is satisfied, after which the token Vector is simply cleared and the new token position tracking begins. Don't forget to make sure that the same input/term text isn't output/highlighted more than once for multiple output tokens. Regards, RBP --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]