> A question before I dive into coding a fix: can I assume (for
> all analyzers) that the tokens produced by the tokenStream
> have the following property:
>    currentToken.startOffset() >= lastToken.startOffset()
> 
> The analyzers I have tested the highlighter with so far have
> the property:
>    currentToken.startOffset() > lastToken.endOffset()
> so aren't overlapping but I understand this isn't the case for
> others (all demonstrable examples of such "problem" analyzers
> would be appreciated for testing purposes).

There is such an analyzer here
http://savannah.nongnu.org/projects/aramorph .

> If I can assume that tokenstreams always produce a zero or more
> increment in token.startOffset I think I can
> design a solution that still works using a single pass of the
> token stream.
> I suspect an additional "flushText" method will be required on
> the Formatter interface to allow implementations
> to use a buffer. This buffer would be required to accumulate
> overlapping token scores when trying to decide if a
> section of the original text merited any highlight markup.

I am not familiar with your most recent highlighter package, but I have implemented 
this myself with some older rudimentary highlighting code that just uses a Vector to 
keep track of all tokens for the same offset positions. Highlighting based on those 
tokens accumulated in the Vector is triggered when currentToken.startOffset() > 
lastToken.startOffset() is satisfied, after which the token Vector is simply cleared 
and the new token position tracking begins. Don't forget to make sure that the same 
input/term text isn't output/highlighted more than once for multiple output tokens.

Regards,
RBP 





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to