Just FYI - you can put Solr plugins in <solr-home>/lib as JAR files rather than messing with solr.war

        Erik

On Sep 16, 2009, at 10:15 AM, Alexey Serba wrote:

Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

        public float lengthNorm(String fieldName, int numTerms) {
                return numTerms > 0 ? 1.0f : 0.0f;
        }
                
        public float tf(float freq) {
                return freq > 0 ? 1.0f : 0.0f;
        }
}
-------------------------------------------------------------------------------------------------------------------

2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com> wrote:
Hello,

Let me preface this by admitting that I'm still fairly new to Lucene and Solr, so I apologize if any of this sounds naive and I'm open to thinking
about my problem differently.

I'm currently responsible for a rather large dataset of business records that I'm trying to build a Lucene/Solr infrastructure around, to replace an in-house solution that we've been using for a few years. These records are sourced from multiple providers and there's often a fair bit of overlap in the business coverage. I have a set of fuzzy correlation libraries that I use to identify these documents and I ultimately create a super- record that includes metadata from each of the providers. Given the nature of things, these providers often have slight variations in wording or spelling in the overlapping fields (it's amazing how many ways people find to refer to the same business or address). I'd like to capture these variations, as they facilitate searching, but TF considerations are currently borking field
scoring here.

For example, taking business names into consideration, I have a Solr schema
similar to:

<field name="name_provider1" type="string" indexed="false" stored="false"
multiValued="true">
...
<field name="name_providerN" type="string" indexed="false" stored="false"
multiValued="true">
<field name="nameNorm" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true">

<copyField source="name_provider1" dest="nameNorm">
...
<copyField source="name_providerN" dest="nameNorm">

For any given business record, there may be 1..N business names present in the nameNorm field (some with naming variations, some identical). With TF enabled, however, I'm getting different match scores on this field simply
based on how many providers contributed to the record, which is not
meaningful to me. For example, a record containing <nameNorm>foo
bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher than a record just containing <nameNorm>foo bar</nameNorm>. Although I wouldn't mind TF data being considered within each discrete field value, I need to find a way to prevent score inflation based simply on the number of
contributing providers.

Looking at the mailing list archive and searching around, it sounds like the omitTf boolean in Lucene used to function somewhat in this manner, but has since taken on a broader interpretation (and name) that now also disables positional and payload data. Unfortunately, phrase support for fields like this is absolutely essential. So what's the best way to address a need like this? I guess I don't mind whether this is handled at index time or search
time, but I'm not sure what I may need to override or if there's some
existing provision I should take advantage of.

Thank you for any help you may have.

Best regards,
Aaron


Reply via email to