tf() is used, just not with the term freq - the length of the matching
Spans is used instead.
The terms from nested Spans will still affect the score (you still get
IDF), but term freq is substituted with matching Span length.
Also, boosts of nested Spans are ignored - only the top level boost is used.
Finally, SpanQuerys match non overlapping Spans, but by SpanQuery
definition of overlap - if the second Span starts one after the start of
the first Span, thats not considered overlap. If it starts before or at
the same position, thats overlap, and you won't see a match.
- Mark
Rads2029 wrote:
Thanks , That helped clear quite a few things.
A few questions though :
1) Regarding tf not making a difference : I do believe that override tf to
return 1 makes a difference.
When I did not override tf the score on doc(AB BC BC CD) was higher on doc (
AB BC CD)
When I did not override tf the score on doc(AB BC xx xx CD) was lesser than
the score on doc ( AB BC CD)
When I overrode tf to return 1 both doc( AB BC BC CD) and ( AB BC CD) had
the same score.
When I overrode tf to return 1 both doc( AB BC xx xx CD) and ( AB BC CD) had
the same score.
I do want doc( AB BC BC CD) and doc( AB BC CD) to show me same score but I
also want score of
doc( AB BC xx xx CD) to be less than score of doc( AB BC CD) .
Also in the score() method of the SpanScorer class , this is the code:
public float score() throws IOException {
float raw = getSimilarity().tf(freq) * value; // raw score
return norms == null? raw : raw * Similarity.decodeNorm(norms[doc]); //
normalize
}
As you can see, tf is used here
2)" if you really just want to know about the lengths of instances of Spans
in your
index, you can call the getSpans method directly on your SpanNearQuery and
iterate over them yourself, ignoring the ones you want to ignore"
Could you throw more light on this? How exactly would I know which ones are
the spans which I need to ignore
2) "subclass SpanQuery so it returns a new NearSpansNoOverlapping that you
would have wrap the NearSpansOrdered and only return the "shortest" span
from each doucment."
Please give me some more details on how to go about this?
Thanks again a lot for ur help.
hossman wrote:
(Disclaimer: i'm not currently looking at the code, this email is entirely
a guess based on what i remember about SpanQueries)
: II ) Using default implementation of tf in Similarity class:
:
: Case 1 - Doc : "AB BC BC CD"
: Result : 4 - Actual score
: % match : ( actual score / max possible score) = ( 4/3) > 100% - This
is
: Wrong as I dont want score to be affected by no of times BC occurs
I suspect you are missunderstanding why you are getting the scores you are
getting.
if i remember correctly, SpanNearQuery ignores all score information
coming from the sub-queries it contains and only scores documents based on
the distances of the matching Spans (this is true for all of hte
"container" span queries i believe - because they all use SpanScorer does
and it *only* looks at the Spans)
So i don't think anything in your SpanNearQuery is actually rewarding a
doc for matching one of the individual terms more then once, because
nothing ever looks at the tf() of the individual terms. (if you use a
custom Similarity, and override the tf(int) method to include some
logging, i'm 90% certain you'll see that that method never get called with
any SpanQuery)
SpanScorer *does* look at every matching Span in a document however -- and
assuming you are allowing slop (and it appears you are since other
examples you list depend on it) the sequence "AB BC CD" exists twice in
your example document above -- once using the BC at position 2, and once
using the BC at position 3 - hence the higher then (you) expected score.
(if you use a custom Similarity, and override the tf(float) method to
include some logging, i'm 90% certain you'll see that that method get
called twice for that span query against an index with only that document
-- once per instance of the span.
I'm fairly certain that finding overlapping spans is considered a
"feature" of SpanQuery. I suspect if you look through the test cases for
SpanNearQuery you'll even find some examples just like yours where it
requires that their be multiple matches.
looking at the online javadocs, i don't see any simple option to prevent
overlapping spans when constructing the SpanNearQuery, but i think it
would be fairly easy for you to subclass SpanQuery so it returns a new
NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and
only return the "shortest" span from each doucment.
Incidently: if you find subclassing SpanNearQuery tedious to do what you,
keep in mind that you don't have to go use IndexSearcher and and deal with
the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you
really just want to know about the lengths of instances of Spans in your
index, you can call the getSpans method directly on your SpanNearQuery and
iterate over them yourself, ignoring the ones you want to ignore.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
- Mark
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org