Just FYI - you can put Solr plugins in <solr-home>/lib as JAR files
rather than messing with solr.war
Erik
On Sep 16, 2009, at 10:15 AM, Alexey Serba wrote:
Hi Aaron,
You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )
You need to
1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {
public float lengthNorm(String fieldName, int numTerms) {
return numTerms > 0 ? 1.0f : 0.0f;
}
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
-------------------------------------------------------------------------------------------------------------------
2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca
HIH,
Alex
On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com>
wrote:
Hello,
Let me preface this by admitting that I'm still fairly new to
Lucene and
Solr, so I apologize if any of this sounds naive and I'm open to
thinking
about my problem differently.
I'm currently responsible for a rather large dataset of business
records
that I'm trying to build a Lucene/Solr infrastructure around, to
replace an
in-house solution that we've been using for a few years. These
records are
sourced from multiple providers and there's often a fair bit of
overlap in
the business coverage. I have a set of fuzzy correlation libraries
that I
use to identify these documents and I ultimately create a super-
record that
includes metadata from each of the providers. Given the nature of
things,
these providers often have slight variations in wording or spelling
in the
overlapping fields (it's amazing how many ways people find to refer
to the
same business or address). I'd like to capture these variations, as
they
facilitate searching, but TF considerations are currently borking
field
scoring here.
For example, taking business names into consideration, I have a
Solr schema
similar to:
<field name="name_provider1" type="string" indexed="false"
stored="false"
multiValued="true">
...
<field name="name_providerN" type="string" indexed="false"
stored="false"
multiValued="true">
<field name="nameNorm" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true">
<copyField source="name_provider1" dest="nameNorm">
...
<copyField source="name_providerN" dest="nameNorm">
For any given business record, there may be 1..N business names
present in
the nameNorm field (some with naming variations, some identical).
With TF
enabled, however, I'm getting different match scores on this field
simply
based on how many providers contributed to the record, which is not
meaningful to me. For example, a record containing <nameNorm>foo
bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring
higher
than a record just containing <nameNorm>foo bar</nameNorm>.
Although I
wouldn't mind TF data being considered within each discrete field
value, I
need to find a way to prevent score inflation based simply on the
number of
contributing providers.
Looking at the mailing list archive and searching around, it sounds
like the
omitTf boolean in Lucene used to function somewhat in this manner,
but has
since taken on a broader interpretation (and name) that now also
disables
positional and payload data. Unfortunately, phrase support for
fields like
this is absolutely essential. So what's the best way to address a
need like
this? I guess I don't mind whether this is handled at index time or
search
time, but I'm not sure what I may need to override or if there's some
existing provision I should take advantage of.
Thank you for any help you may have.
Best regards,
Aaron