Re: Multiword Highlighting

Mark Miller Thu, 15 Feb 2007 15:56:46 -0800

Good catch Erick! I'll have to tackle this as well. Mark H is theoriginator of that code so maybe he will chime in, but what I am thinkis this:

In the getSpansFromBooleanquery, keep track of which clauses arerequired. Then based on if any Spans are actually returned fromgetSpansFromTerm for each required clause, add only the correct spans tothe returned spans. If you get what I mean <g>. I am sure there are somemore cases than that to consider, but I think the direction might work.


If you don't tackle it or can't share I'll be doing it myself.

- Mark

Erick Erickson wrote:

I hope you're all following this old thread, because I've just run into

something I don't quite know what to do about with the SpansExtractorcode

that I shamelessly stole.

Let's say my text is "a b c d e f g h" and my query is "a AND z". The

implementation I stole for SpansExtractor (mentioned several times inthis

thread) returns a span for "a" which doesn't preserve the sense of the

query. The root of the problem is that when it gets down to assemblingthegetSpansFromTermQuery, the sense of "AND" is lost and I get span forthe "a"

in the query.

The rest of the kinds of spans don't seem to have the same issue. ORshould

return the "a" in the example above. Any phrase queries that come through
work fine. In fact, our application requires that we have an implied

proximity, mostly anyway, so I haven't had to deal with this untilnow.....


One way, it seems to me, to handle this would be to transform the query
above into a span query with a limit of 10,000, where 10,000 is a magic
number that I'm confident is OK in my application because of the
PositionIncrementGaps I set up during indexing.

Is there a more elegant way of doing this? Or am I missing the boat
entirely? Or did I mess up when I stole the code?

Or, and this would be the easiest for me at least, has this work already

been done and all I really need to do is get a differentimplementation of

SpansExtractor <G>?

Thanks
Erick


On 2/2/07, Mark Miller <[EMAIL PROTECTED]> wrote:




mark harwood wrote:
> Hi Mark,
> Have you looked at the returned spans from any other potential problem
scenarios (other than the 3 word one you suggest) e.g. complex nested
"SpanOr" or "SpanNot" logic?
>
Nothing super intense, but I haved look at some semi complex nesting and
it all looks great if you use the full span highlighting...highlighting
the first and last word of the span only works great if your limited to
word to word proximity searching (like in my parser <G> works great for
my sentence and paragraph proximity searching, though i had to add the
option of hiding my index marker tokens from the output)

Perhaps you know of something that I haven't run into that may not
highlight correctly ?
> Can you attach your code to a new Jira entry so I can have a play?
>
>
I certainly will.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiword Highlighting

Reply via email to