Re: Multiword Highlighting

Erick Erickson Thu, 15 Feb 2007 16:55:40 -0800

Excellent! I'll give it a whirl in the morning. This may keep me from having
to rebuild my index as well, oh joy!


Thanks
Erick

On 2/15/07, Mark Miller <[EMAIL PROTECTED]> wrote:


Here is my initial attempt...I believe it might be sufficient:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;

import java.io.IOException;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;


public class QuerySpansExtractor {
    public Spans[] extractSpans(Query query, IndexReader reader)
        throws IOException {
        List spans = getSpans(query, reader);

        return (Spans[]) spans.toArray(new Spans[spans.size()]);
    }

    private List getSpans(Query query, IndexReader reader)
        throws IOException {
        Spans spans = null;

        if (query instanceof BooleanQuery) {
            return getSpansFromBooleanQuery((BooleanQuery) query, reader);
        } else if (query instanceof PhraseQuery) {
            spans = getSpansFromPhraseQuery((PhraseQuery) query, reader);
        } else if (query instanceof TermQuery) {
            spans = getSpansFromTermQuery((TermQuery) query, reader);
        } else if (query instanceof SpanQuery) {
            spans = getSpansFromSpanQuery((SpanQuery) query, reader);
        }

        List spanList = new ArrayList(1);
        spanList.add(spans);

        return spanList;
    }

    private List getSpansFromBooleanQuery(BooleanQuery query,
IndexReader reader)
        throws IOException {
        BooleanClause[] queryClauses = query.getClauses();
        int i;
        boolean useQuery = true;
        List possibleSpans = new ArrayList();

        for (i = 0; i < queryClauses.length; i++) {
            if (queryClauses[i].isProhibited()) {
                List prohibSpans = getSpans(queryClauses[i].getQuery(),
reader);

                if (((Spans) prohibSpans.get(0)).next()) {
                    useQuery = false;
                } else {
                    possibleSpans.addAll(prohibSpans);
                }
            } else if (queryClauses[i].isRequired()) {
                List reqSpans = getSpans(queryClauses[i].getQuery(),
reader);

                if (((Spans) reqSpans.get(0)).next()) {
                    useQuery = false;
                } else {
                    possibleSpans.addAll(reqSpans);
                }
            } else {

possibleSpans.addAll(getSpans(queryClauses[i].getQuery(), reader));
            }
        }

        if (!useQuery) {
            possibleSpans = Collections.EMPTY_LIST;
        }

        return possibleSpans;
    }

    private Spans getSpansFromPhraseQuery(PhraseQuery query, IndexReader
reader)
        throws IOException {
        Term[] queryTerms = query.getTerms();
        int i;
        SpanQuery[] clauses = new SpanQuery[queryTerms.length];

        for (i = 0; i < queryTerms.length; i++) {
            clauses[i] = new SpanTermQuery(queryTerms[i]);
        }

        SpanNearQuery sp = new SpanNearQuery(clauses, query.getSlop(),
false);
        sp.setBoost(query.getBoost());

        return sp.getSpans(reader);
    }

    private Spans getSpansFromTermQuery(TermQuery query, IndexReader
reader)
        throws IOException {
        SpanTermQuery stq = new SpanTermQuery(query.getTerm());
        stq.setBoost(query.getBoost());

        return stq.getSpans(reader);
    }

    private Spans getSpansFromSpanQuery(SpanQuery query, IndexReader
reader)
        throws IOException {
        return query.getSpans(reader);
    }
}



Erick Erickson wrote:
> Mark:
>
> Thanks, that reassures me that I'm not hallucinating. If it gets on my
> priority list I can certainly share the code, since I stole it in the
> first
> place <G>. I have a semi-solution for now that gets me out from under
the
> immediate problem, but it really wants a more robust solution than the
> one
> I'm using.
>
> Thanks
> Erick
>
> On 2/15/07, Mark Miller <[EMAIL PROTECTED]> wrote:
>>
>> Good catch Erick! I'll have to tackle this as well. Mark H is the
>> originator of that code so maybe he will chime in, but what I am think
>> is this:
>>
>> In the getSpansFromBooleanquery, keep track of which clauses are
>> required. Then based on if any Spans are actually returned from
>> getSpansFromTerm for each required clause, add only the correct spans
to
>> the returned spans. If you get what I mean <g>. I am sure there are
some
>> more cases than that to consider, but I think the direction might work.
>>
>> If you don't tackle it or can't share I'll be doing it myself.
>>
>> - Mark
>>
>> Erick Erickson wrote:
>> > I hope you're all following this old thread, because I've just run
>> into
>> > something I don't quite know what to do about with the SpansExtractor
>> > code
>> > that I shamelessly stole.
>> >
>> > Let's say my text is "a b c d e f g h" and my query is "a AND z". The
>> > implementation I stole for SpansExtractor (mentioned several times in
>> > this
>> > thread) returns a span for "a" which doesn't preserve the sense of
the
>> > query. The root of the problem is that when it gets down to
assembling
>> > the
>> > getSpansFromTermQuery, the sense of "AND" is lost and I get span for
>> > the "a"
>> > in the query.
>> >
>> > The rest of the kinds of spans don't seem to have the same issue. OR
>> > should
>> > return the "a" in the example above. Any phrase queries that come
>> through
>> > work fine. In fact, our application requires that we have an implied
>> > proximity, mostly anyway, so I haven't had to deal with this until
>> > now.....
>> >
>> > One way, it seems to me, to handle this would be to transform the
>> query
>> > above into a span query with a limit of 10,000, where 10,000 is a
>> magic
>> > number that I'm confident is OK in my application because of the
>> > PositionIncrementGaps I set up during indexing.
>> >
>> > Is there a more elegant way of doing this? Or am I missing the boat
>> > entirely? Or did I mess up when I stole the code?
>> >
>> > Or, and this would be the easiest for me at least, has this work
>> already
>> > been done and all I really need to do is get a different
>> > implementation of
>> > SpansExtractor <G>?
>> >
>> > Thanks
>> > Erick
>> >
>> >
>> > On 2/2/07, Mark Miller <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >>
>> >> mark harwood wrote:
>> >> > Hi Mark,
>> >> > Have you looked at the returned spans from any other potential
>> problem
>> >> scenarios (other than the 3 word one you suggest) e.g. complex
nested
>> >> "SpanOr" or "SpanNot" logic?
>> >> >
>> >> Nothing super intense, but I haved look at some semi complex nesting
>> and
>> >> it all looks great if you use the full span
>> highlighting...highlighting
>> >> the first and last word of the span only works great if your
>> limited to
>> >> word to word proximity searching (like in my parser <G> works
>> great for
>> >> my sentence and paragraph proximity searching, though i had to add
>> the
>> >> option of hiding my index marker tokens from the output)
>> >>
>> >> Perhaps you know of something that I haven't run into that may not
>> >> highlight correctly ?
>> >> > Can you attach your code to a new Jira entry so I can have a play?
>> >> >
>> >> >
>> >> I certainly will.
>> >>
>> >> - Mark
>> >>
>> >>
---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiword Highlighting

Reply via email to