[
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2392:
--------------------------------
Attachment: LUCENE-2392.patch
Updated patch, i brought the patch to trunk, cleaned up, enabled some more of
the stats in scoring (e.g. totalTermFreq/sumOfTotalTermFreq).
In src/test i added a MockLMSimilarity, that implements "Bayesian smoothing
using Dirichlet priors" from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.8113
This one is interesting, as its faster than lucene's scoring formula today :)
I want to get some of this stuff in shape for David (or any other GSOC
students) to be able to implement their algorithms, but there is a lot of
refactoring (e.g. explains) to do.
I'll create a branch under
https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring with this
infrastructure in a bit.
Tonight i'll see if i can get the avg doc length stuff in the branch too.
> Enable flexible scoring
> -----------------------
>
> Key: LUCENE-2392
> URL: https://issues.apache.org/jira/browse/LUCENE-2392
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2392.patch, LUCENE-2392.patch, LUCENE-2392.patch,
> LUCENE-2392_take2.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API. It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes. But it's not at all tied into actual searching! There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
> - field's boost
> - field's unique term count (a b c a a b --> 3)
> - field's total term count (a b c a a b --> 6)
> - total term count per-term (sum of total term count for all docs
> that have this term)
> Still need at least the total term count for each field.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]