Re: Disabling tf (term frequency) during indexing and/or scoring

Erik Hatcher Wed, 16 Sep 2009 07:36:06 -0700

Just FYI - you can put Solr plugins in <solr-home>/lib as JAR filesrather than messing with solr.war


        Erik


On Sep 16, 2009, at 10:15 AM, Alexey Serba wrote:

Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

        public float lengthNorm(String fieldName, int numTerms) {
                return numTerms > 0 ? 1.0f : 0.0f;
        }
                
        public float tf(float freq) {
                return freq > 0 ? 1.0f : 0.0f;
        }
}
-------------------------------------------------------------------------------------------------------------------

2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex
On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com>wrote:
Hello,
Let me preface this by admitting that I'm still fairly new toLucene andSolr, so I apologize if any of this sounds naive and I'm open tothinking
about my problem differently.
I'm currently responsible for a rather large dataset of businessrecordsthat I'm trying to build a Lucene/Solr infrastructure around, toreplace anin-house solution that we've been using for a few years. Theserecords aresourced from multiple providers and there's often a fair bit ofoverlap inthe business coverage. I have a set of fuzzy correlation librariesthat Iuse to identify these documents and I ultimately create a super-record thatincludes metadata from each of the providers. Given the nature ofthings,these providers often have slight variations in wording or spellingin theoverlapping fields (it's amazing how many ways people find to referto thesame business or address). I'd like to capture these variations, astheyfacilitate searching, but TF considerations are currently borkingfield
scoring here.
For example, taking business names into consideration, I have aSolr schema
similar to:
<field name="name_provider1" type="string" indexed="false"stored="false"
multiValued="true">
...
<field name="name_providerN" type="string" indexed="false"stored="false"
multiValued="true">
<field name="nameNorm" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true">

<copyField source="name_provider1" dest="nameNorm">
...
<copyField source="name_providerN" dest="nameNorm">
For any given business record, there may be 1..N business namespresent inthe nameNorm field (some with naming variations, some identical).With TFenabled, however, I'm getting different match scores on this fieldsimply
based on how many providers contributed to the record, which is not
meaningful to me. For example, a record containing <nameNorm>foo
bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoringhigherthan a record just containing <nameNorm>foo bar</nameNorm>.Although Iwouldn't mind TF data being considered within each discrete fieldvalue, Ineed to find a way to prevent score inflation based simply on thenumber of
contributing providers.
Looking at the mailing list archive and searching around, it soundslike theomitTf boolean in Lucene used to function somewhat in this manner,but hassince taken on a broader interpretation (and name) that now alsodisablespositional and payload data. Unfortunately, phrase support forfields likethis is absolutely essential. So what's the best way to address aneed likethis? I guess I don't mind whether this is handled at index time orsearch
time, but I'm not sure what I may need to override or if there's some
existing provision I should take advantage of.

Thank you for any help you may have.

Best regards,
Aaron

Re: Disabling tf (term frequency) during indexing and/or scoring

Reply via email to