Re: Please help me with a basic question...

Paul Libbrecht Wed, 18 May 2011 12:35:39 -0700

Richard,

in SOLR at least there's an analyzer that avoids duplicates.
I think that would solve it.
There's also somewhere the option to ignore IDF (in similarity? in solrconfig?).


paul


Le 18 mai 2011 à 21:30, Rich Heimann a écrit :

> Hello all,
> 
> This is my first time on the list and my first question...forgive me it this
> has been hacked out in the past.
> 
> We have set up Lucene/Solr and are getting somewhat spurious results. It
> appears to be a result of heterogeneous document sizes. In other words, the
> top results are sometimes (at least when the user is using typical search
> terms) monopolized by a distinct type of document, which is otherwise small
> (in number of terms). It appears that TF/IDF even with the cosine similarity
> is sensitive to document size. I have run some tests and it in fact does
> appear to be the case.
> 
> (Number of times the term appears in a document)/(Total Number of terms in
> that document) * Log10(Number of total documents/Number of times search term
> appears in all documents)
> 
> Are there any suggestions or best practices to deal with the intrinsic
> heterogeneity in a corpus.
> 
> Thank you,
> Rich


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Please help me with a basic question...

Reply via email to