[jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics

David Mark Nemeskey (JIRA) Tue, 07 Jun 2011 12:25:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045598#comment-13045598
 ]


David Mark Nemeskey commented on LUCENE-3174:
---------------------------------------------

Here's what the patch does:
- it introduces the Similarity.Stats class and its subclasses
- renames computeWeight() to computeStats()
- fixes methods that call computeStats()

What remains to be done:
- rewrite the javadoc
- Stats will be used inside other Similarity methods: its availability should 
be unsured somehow. The current solution in MockBM25Similarity is not 
satisfactory because there is only one Similarity object at a time.
- MultiPhraseWeight, PhraseWeight, SpanWeight, TermWeight call computeStats and 
extract the IDFExplain object. This level of coupling is not desirable, and 
should be eliminated. All the more so, as not all Similarity subclasses will 
have an idf
- It might not even make sense to expose computeStats()?

To consider:
- it might be better if Stats were static, because they could inherit fields 
from each other

> Similarity.Stats class for term & collection statistics
> -------------------------------------------------------
>
>                 Key: LUCENE-3174
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3174
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3174.patch
>
>
> In order to support ranking methods besides TF-IDF, we need to make the 
> statistics they need available. These statistics could be computed in 
> computeWeight (soon to become computeStats) and stored in a separate object 
> for easy access. Since this object will be used solely by subclasses of 
> Similarity, it should be implented as a static inner class, i.e. 
> Similarity.Stats.
> There are two ways this could be implemented:
> - as a single Similarity.Stats class, reused by all ranking algorithms. In 
> this case, this class would have a member field for all statistics;
> - as a hierarchy of Stats classes, one for each ranking algorithm. Each 
> subclass would define only the statistics needed for the ranking algorithm.
> In the second case, the Stats class in DefaultSimilarity would have a single 
> field, idf, while the one in e.g. BM25Similarity would have idf and average 
> field/document length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics

Reply via email to