Hi,

I would like to customize the lucene scoring to remove the effect of the
"coord()" parameter in Similarity for the number of fields in which the
whole query is found.

First, here's what I'm trying to achieve:

For example, if I have the query "New York" and that the documents have 2
fields: name and description. What I want is that it doesn't make a
difference if "New York" is found in both the name and the description or
just one of them. Similarity.coord() seems to be the parameter responsible
for that...

My first attempt to solve that was to supply my own implementation of
Similarity that is a subclass of DefaultSimilarity and simply override
coord(). To my surprise, I found out that that parameter is not only used to
take into consideration the number of fields in which the query was found
but also the number of terms in the query found in a given field. So, the
effect is that the number of fields in which the query is found is cancelled
but it introduces the problem that now the query "New York" doesn't favor
"New York" over "York" anymore, which is really problematic...

Then, I saw this:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/package-summary.html#scoring

It's a very vague description. I then started to look into the lucene code
and now I'm at the point where I don't really know where to start and if I'm
looking at the right thing.

By the way, I'm using Solr... but the scoring is a lucene thing so I think
I'm at the right place.

So, what I'm looking for is a hint on where and how I should start.

Thanks in advance,
Sebastien

Reply via email to