Richard, in SOLR at least there's an analyzer that avoids duplicates. I think that would solve it. There's also somewhere the option to ignore IDF (in similarity? in solrconfig?).
paul Le 18 mai 2011 à 21:30, Rich Heimann a écrit : > Hello all, > > This is my first time on the list and my first question...forgive me it this > has been hacked out in the past. > > We have set up Lucene/Solr and are getting somewhat spurious results. It > appears to be a result of heterogeneous document sizes. In other words, the > top results are sometimes (at least when the user is using typical search > terms) monopolized by a distinct type of document, which is otherwise small > (in number of terms). It appears that TF/IDF even with the cosine similarity > is sensitive to document size. I have run some tests and it in fact does > appear to be the case. > > (Number of times the term appears in a document)/(Total Number of terms in > that document) * Log10(Number of total documents/Number of times search term > appears in all documents) > > Are there any suggestions or best practices to deal with the intrinsic > heterogeneity in a corpus. > > Thank you, > Rich --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org