[jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics

David Mark Nemeskey (JIRA) Sat, 11 Jun 2011 08:37:55 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047928#comment-13047928
 ]


David Mark Nemeskey commented on LUCENE-3174:
---------------------------------------------

bq. I would disagree in this case, i think a composite similarity that has N 
sub-similarities would just return a MultiStats that keeps these in an array, 
as this composite doesnt care at all whats in them, it just needs to be able to 
delegate them back to the sub's docscorers later.

I didn't think of that. I really like this idea.

As for Stats, I see several advantages of a single class:
- no need for casting. It may be just me, but having to cast everywhere doesn't 
feel right for me;
- we show in one place what statistics the ranking algorithms use, the user of 
the library doesn't need to "hunt" for this information;
- I think there will be Similarities that use the same Stats subclass, e.g. 
MockLMSimilarity uses TFIDFSimilarity.IDFStats. Or it could define its own 
Stats that looks exactly the same. Either solution seems a bit strange for me;
- one less class to write if you want to add a new Similarity (provided you 
don't need a new statistic, in which case you have to write your own and cast 
it).

> Similarity.Stats class for term & collection statistics
> -------------------------------------------------------
>
>                 Key: LUCENE-3174
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3174
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3174.patch, LUCENE-3174.patch
>
>
> In order to support ranking methods besides TF-IDF, we need to make the 
> statistics they need available. These statistics could be computed in 
> computeWeight (soon to become computeStats) and stored in a separate object 
> for easy access. Since this object will be used solely by subclasses of 
> Similarity, it should be implented as a static inner class, i.e. 
> Similarity.Stats.
> There are two ways this could be implemented:
> - as a single Similarity.Stats class, reused by all ranking algorithms. In 
> this case, this class would have a member field for all statistics;
> - as a hierarchy of Stats classes, one for each ranking algorithm. Each 
> subclass would define only the statistics needed for the ranking algorithm.
> In the second case, the Stats class in DefaultSimilarity would have a single 
> field, idf, while the one in e.g. BM25Similarity would have idf and average 
> field/document length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics

Reply via email to