[jira] [Issue Comment Edited] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

sebastian L. (Issue Comment Edited) (JIRA) Sat, 01 Oct 2011 05:58:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118784#comment-13118784
 ]


sebastian L. edited comment on LUCENE-3440 at 10/1/11 12:56 PM:
----------------------------------------------------------------

Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 
4.0-SNAPSHOT.  

Another patch, another idea! :)

Some thoughts: 
- With the last patch, sum-of-distinct-weights will be calculated anyhow, even 
if ScoreOrderFragmentsBuilder is used. 
- Also regardless of further calculations, FieldTermsStack retrieves document 
frequency for each term from IndexReader in any case.
- Solr-Developers have no chance to implement a FragmentsBuilder-plugin with 
their custom-scoring for fragments, because the weighting-formula is 
"hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on 
this patch anyway.   

Possible Solution:

1. Collect and pass all needed Informations to the 
BaseFragmentsBuilder-implementation 
- Introduction of TermInfo.fieldName
- Introduction of WeightedFragInfo.phraseInfos
- Passing a instance of IndexReader as argument to 
BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed 
statistical data from the index

2. Move the calculation of sum-of-boosts to 
ScoreOrderFramentsBuilder.calculateScore()

{code}    
  /**
   * Compute WeightedFragInfo.score based on query-boosts
   * @throws IOException 
   */
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> 
weightedFragInfos, IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        wfi.score += wpi.boost;
      }
    }
    return weightedFragInfos;
  }
{code}

3. Calculation of sum-of-distinct-weights with 
WeightOrderFramentsBuilder.calculateScore()

- In this patch WeightOrderFramentsBuilder is a subclass of 
ScoreOrderFragmentsBuilder.
- But I think the introduction of an abstract class OrderedFragmentsBuilder as 
superclass of ScoreOrderFragmentsBuilder and WeightOrderFragmentsBuilder would 
be a better strategy.  
- Moving calculateScore() into BaseFragmentsBuilder and making it abstract 
would be another idea. 
- The _sum-of-distinct-weight_-approach is the same as presented in the last 
patch.

{code}
  /**
   * Compute WeightedFragInfo.score based on IDF-weighted terms
   * @throws IOException 
   */
  @Override
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> 
weightedFragInfos, IndexReader reader ) throws IOException{
    
    Map<String, Float> lookup = new HashMap<String, Float>(); 
    HashSet<String> distinctTerms  = new HashSet<String>();
    
    int numDocs = reader.numDocs() - reader.numDeletedDocs();
    
    int docFreq;
    int length;
    float boost;
    float weight;
    
    for( WeightedFragInfo wfi : weightedFragInfos ){
      uniqueTerms.clear();
      length = 0;
      boost = 0;
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        for( TermInfo ti : wpi.termInfos ) {
          length++;
          if( !distinctTerms.add( ti.text ) ) 
            continue;
          if ( lookup.containsKey( ti.text ) )
            weight = lookup.get( ti.text ).floatValue();
          else {
            docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
            weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) 
) + 1.0 );
            lookup.put( ti.text, new Float( weight ) );
          }
          boost += Math.pow( weight, 2 ) * wpi.boost;
        }
      }
      wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }
    
    return weightedFragInfos;
  }
{code}

With this approach programmers can implement their own fragments-weighting with 
ease, simply overwriting calculateScore(). 

I think, the major drawback of this idea is that the FragmentsBuilder must 
traverse the whole stack of WeightedFragInfo once again. Since we have tomes 
with more than 3000 pages of OCR, this _could_ be a problem. But I can't 
confirm that for sure. One way to avoid this would be making FieldFragList 
"plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could 
be parametrized with the intended implementation of FragList:

{code:xml}
<highlighter>
 <fragmentsBuilder name="weight-ordered" 
class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
  <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
 </fragmentsBuilder>
 <fragmentsBuilder name="boost-ordered" 
class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
  <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
 </fragmentsBuilder>
</highlighter>
{code}    

Further notes:
- As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into 
"WeightedFragInfo.score".    
- "ScoreOrderFragmentsBuilder" should be renamed into 
"BoostOrderFragmentsBuilder".
                
      was (Author: mdz-munich):
    Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 
4.0-SNAPSHOT.  

Another patch, another idea! :)

Some thoughts: 
- With the last patch, sum-of-distinct-weights will be calculated anyhow, even 
if ScoreOrderFragmentsBuilder is used. 
- Also regardless of further calculations, FieldTermsStack retrieves document 
frequency for each term from IndexReader in any case.
- Solr-Developers have no chance to implement a FragmentsBuilder-plugin with 
their custom-scoring for fragments, because the weighting-formula is 
"hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on 
this patch anyway.   

Possible Solution:

1. Collect and pass all needed Informations to the 
BaseFragmentsBuilder-implementation 
- Introduction of TermInfo.fieldName
- Introduction of WeightedFragInfo.phraseInfos
- Passing a instance of IndexReader as argument to 
BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed 
statistical data from the index

2. Move the calculation of sum-of-boosts to 
ScoreOrderFramentsBuilder.calculateScore()

{code}    
  /**
   * Compute WeightedFragInfo.score based on query-boosts
   * @throws IOException 
   */
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> 
weightedFragInfos, IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        wfi.score += wpi.boost;
      }
    }
    return weightedFragInfos;
  }
{code}

3. Calculation of sum-of-distinct-weights with 
WeightOrderFramentsBuilder.calculateScore()

- In this patch WeightOrderFramentsBuilder is a subclass of 
ScoreOrderFragmentsBuilder.
- But I think the introduction of an abstract class OrderedFragmentsBuilder as 
superclass of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would 
be a better strategy.  
- Moving calculateScore() into BaseFragmentsBuilder and making it abstract 
would be another idea. 
- The _sum-of-distinct-weight_-approach is the same as presented in the last 
patch.

{code}
  /**
   * Compute WeightedFragInfo.score based on IDF-weighted terms
   * @throws IOException 
   */
  @Override
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> 
weightedFragInfos, IndexReader reader ) throws IOException{
    
    Map<String, Float> lookup = new HashMap<String, Float>(); 
    HashSet<String> distinctTerms  = new HashSet<String>();
    
    int numDocs = reader.numDocs() - reader.numDeletedDocs();
    
    int docFreq;
    int length;
    float boost;
    float weight;
    
    for( WeightedFragInfo wfi : weightedFragInfos ){
      uniqueTerms.clear();
      length = 0;
      boost = 0;
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        for( TermInfo ti : wpi.termInfos ) {
          length++;
          if( !distinctTerms.add( ti.text ) ) 
            continue;
          if ( lookup.containsKey( ti.text ) )
            weight = lookup.get( ti.text ).floatValue();
          else {
            docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
            weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) 
) + 1.0 );
            lookup.put( ti.text, new Float( weight ) );
          }
          boost += Math.pow( weight, 2 ) * wpi.boost;
        }
      }
      wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }
    
    return weightedFragInfos;
  }
{code}

With this approach programmers can implement their own fragments-weighting with 
ease, simply overwriting calculateScore(). 

I think, the major drawback of this idea is that the FragmentsBuilder must 
traverse the whole stack of WeightedFragInfo once again. Since we have tomes 
with more than 3000 pages of OCR, this _could_ be a problem. But I can't 
confirm that for sure. One way to avoid this would be making FieldFragList 
"plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could 
be parametrized with the intended implementation of FragList:

{code:xml}
<highlighter>
 <fragmentsBuilder name="weight-ordered" 
class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
  <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
 </fragmentsBuilder>
 <fragmentsBuilder name="boost-ordered" 
class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
  <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
 </fragmentsBuilder>
</highlighter>
{code}    

Further notes:
- As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into 
"WeightedFragInfo.score".    
- As shown in this patch "ScoreOrderFragmentsBuilder" should be renamed into 
"BoostOrderFragmentsBuilder".
                  
> FastVectorHighlighter: IDF-weighted terms for ordered fragments 
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3440
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3440
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 3.5, 4.0
>            Reporter: sebastian L.
>            Priority: Minor
>              Labels: FastVectorHighlighter
>             Fix For: 3.5, 4.0
>
>         Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, 
> LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, 
> WeightOrderFragmentsBuilder_table01.html, 
> WeightOrderFragmentsBuilder_table02.html
>
>
> The FastVectorHighlighter uses for every term found in a fragment an equal 
> weight, which causes a higher ranking for fragments with a high number of 
> words or, in the worst case, a high number of very common words than 
> fragments that contains *all* of the terms used in the original query. 
> This patch provides ordered fragments with IDF-weighted terms: 
> total weight = total weight + IDF for unique term per fragment * boost of 
> query; 
> The ranking-formula should be the same, or at least similar, to that one used 
> in org.apache.lucene.search.highlight.QueryTermScorer.
> The patch is simple, but it works for us. 
> Some ideas:
> - A better approach would be moving the whole fragments-scoring into a 
> separate class.
> - Switch scoring via parameter 
> - Exact phrases should be given a even better score, regardless if a 
> phrase-query was executed or not
> - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
> corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

Reply via email to