Hello all, This is my first time on the list and my first question...forgive me it this has been hacked out in the past.
We have set up Lucene/Solr and are getting somewhat spurious results. It appears to be a result of heterogeneous document sizes. In other words, the top results are sometimes (at least when the user is using typical search terms) monopolized by a distinct type of document, which is otherwise small (in number of terms). It appears that TF/IDF even with the cosine similarity is sensitive to document size. I have run some tests and it in fact does appear to be the case. (Number of times the term appears in a document)/(Total Number of terms in that document) * Log10(Number of total documents/Number of times search term appears in all documents) Are there any suggestions or best practices to deal with the intrinsic heterogeneity in a corpus. Thank you, Rich