[jira] Commented: (LUCENE-1522) another highlighter

Koji Sekiguchi (JIRA) Sat, 14 Mar 2009 17:59:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682111#action_12682111
 ]


Koji Sekiguchi commented on LUCENE-1522:
----------------------------------------

Mike, I'm sorry for late reply.

bq. Is this approach guaranteed to only highlight term occurrences that 
actually contribute to the document match?

I'm not sure if I understand what you are asking, but if you talk about 
"hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature:

{code:java}
/**
 * a constructor. A FragListBuilder and a FragmentsBuilder can be specified 
(plugins).
 * 
 * @param phraseHighlight true of false for phrase highlighting
 * @param fieldMatch true of false for field matching
 * @param fragListBuilder an instance of FragListBuilder
 * @param fragmentsBuilder an instance of FragmentsBuilder
 */
public Highlighter( boolean phraseHighlight, boolean fieldMatch, 
FragListBuilder fragListBuilder, FragmentsBuilder fragmentsBuilder ){
  this.phraseHighlight = phraseHighlight;
  this.fieldMatch = fieldMatch;
  this.fragListBuilder = fragListBuilder;
  this.fragmentsBuilder = fragmentsBuilder;
}
{code}

bq. Can it handle all / arbitrary Query subclasses?

Currently, no. Highlighter2 calls flatten() method to try to flat the 
sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and 
PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery:

{code:title=FieldQuery.java}
void flatten( Query sourceQuery, Collection<Query> flatQueries ){
  if( sourceQuery instanceof BooleanQuery ){
    BooleanQuery bq = (BooleanQuery)sourceQuery;
    for( BooleanClause clause : bq.getClauses() ){
      if( !clause.isProhibited() )
        flatten( clause.getQuery(), flatQueries );
    }
  }
  else if( sourceQuery instanceof TermQuery ){
    if( !flatQueries.contains( sourceQuery ) )
      flatQueries.add( sourceQuery );
  }
  else if( sourceQuery instanceof PhraseQuery ){
    if( !flatQueries.contains( sourceQuery ) ){
      PhraseQuery pq = (PhraseQuery)sourceQuery;
      if( pq.getTerms().length > 1 )
        flatQueries.add( pq );
      else if( pq.getTerms().length == 1 ){
        flatQueries.add( new TermQuery( pq.getTerms()[0] ) );
      }
    }
  }
  // else discard queries
}
{code}

But I'm always positive to support all / arbitrary Query subclasses in H2. :)

bq. How does it score fragments?

Currently, H2 takes into account query time boost and tf in fragment. For 
example, if we have q="a OR b^3" and two fragment candidates f1="a a a" and 
f2="a b", f1 gets 3 and f2 gets 4, getBestFragments() will return f2 first, 
then f1 when ScoreOrderFragmentsBuilder (default) is used.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1522) another highlighter

Reply via email to