Adding Support for occurances in SpanQueries

Daniel Shane Wed, 09 Mar 2011 13:52:46 -0800

I'm currently working on a project that involves highlighting all the words in 
document that match a given Query.


Right now, there is a highlighter in Lucene, but all it does, I think, is to 
take the query, extract the terms out of it, and highlight every term.

I presume this is what everyone wants usually, but in my case, what I want is 
to match every word that is actually part of the queries internal evaluation.

For example, Lets say I used a SpanNearNot query, I would not want to highlight 
the terms in the spans that were excluded. 

I was thinking of adding this feature to the SpanQueries, since they share an 
API that regular Queries do not have: getSpans().

Regular queries, I think, do not allow us to get the positions of the matched 
elements in the query (if any matched) so I would not touch these.

Considering SpanQueries have the getSpans() method, I wanted to add this API to 
it :

*****************************

public abstract class Spans {
  public abstract boolean next() throws IOException;
  public abstract boolean skipTo(int target) throws IOException;
  public abstract int doc();
  public abstract int start();
  public abstract int end();
  
  public abstract Collection/*<byte[]>*/ getPayload() throws IOException;
  public abstract boolean isPayloadAvailable();

  //NEW STUFF HERE
  public abstract Collection/*SpanMatchedTerm*/ getSpanMatchedTerms();
}

public class SpanMatchedTerm {
    public Term term;
    public String displayName;
    public int position;
    
    /**
     * Creates a MatchedTerm. The displayName is an optional name that
     * refers to this query. Used when term.getTerm() is not enough.
     * A good example would be when you stem terms.
     * You could use the displayName as the non-stemmed text, which
     * you would use afterwards to display this match.
    **/
    public SpanMatchedTerm(Term term, String displayName, int position) {
        this.term = term;
        this.position = position;
        this.displayName = displayName;
    }        
}

******************************

So basically, I can create a SpanQuery, then call getSpans() on it, cycle 
through the spans, each time calling getSpanMatchedTerms() to get the 
individual terms that allowed this span to match. 

The getSpanMatchedTerms would work just like the getPayloads, except it will 
return the positions of the match along with whatever optional displayName you 
tagged along for this term.

The displayName is useful if you want to write a SpanWildcardQuery() that 
mimics the WildcardQuery. In that case, you would like to highlight every term, 
but if you want to show a navigation bar to cycle through hits, you want to 
show the original term with the wildcard in it, not every different term that 
matched.

Do you think its the good way of going about this problem?

Would it stand a chance of getting included if this implementation was submited 
as a patch along with the fixes to the various Spans*** classes to make it work?

Thanks!
Daniel Shane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Adding Support for occurances in SpanQueries

Reply via email to