... I've read more forum discussions on this issue and some people point out (like LIA 2nd ed, p.183, does) that using a filter reduces the number of documents under consideration and impacts IDF and therefore the overall score. Moreover, the recommendation in such forum discussions is that, unless a high performance gain can be obtained via CachingWrapperFilter, MUST BooleanClauses are preferred to Filters.
This doesn't quite make sense to me: the number of documents in the collection, the size of the vocabulary, the size of each posting list and the IDF of each term are known after indexing and should not be affected by filtering. To test this, I further modified the same LIA example and compared the use of a BooleanClause and the use of a Filter: Q = category:/technology/computers/programming/methodology category:/philosophy/eastern +pubmonth:[200501 TO 201012] ---------- Tao Te Ching ??? 1.4739084 = (MATCH) product of: 2.2108626 = (MATCH) sum of: 1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product of: 0.68659997 = queryWeight(category:/philosophy/eastern), product of: 2.871802 = idf(docFreq=1, maxDocs=13) 0.23908332 = queryNorm 2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of: 1.0 = tf(termFreq(category:/philosophy/eastern)=1) 2.871802 = idf(docFreq=1, maxDocs=13) 1.0 = fieldNorm(field=category, doc=4) 0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]), product of: 1.0 = boost 0.23908332 = queryNorm 0.6666667 = coord(2/3) Q = +(category:/technology/computers/programming/methodology category:/philosophy/eastern) +pubmonth:[200501 TO 201012] ---------- Tao Te Ching ??? 1.224973 = (MATCH) sum of: 0.9858896 = (MATCH) product of: 1.9717792 = (MATCH) sum of: 1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product of: 0.68659997 = queryWeight(category:/philosophy/eastern), product of: 2.871802 = idf(docFreq=1, maxDocs=13) 0.23908332 = queryNorm 2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of: 1.0 = tf(termFreq(category:/philosophy/eastern)=1) 2.871802 = idf(docFreq=1, maxDocs=13) 1.0 = fieldNorm(field=category, doc=4) 0.5 = coord(1/2) 0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]), product of: 1.0 = boost 0.23908332 = queryNorm Q = category:/technology/computers/programming/methodology category:/philosophy/eastern Date = pubmonth:[200501 TO 201112] ---------- Tao Te Ching ??? 1.0153353 = (MATCH) product of: 2.0306706 = (MATCH) sum of: 2.0306706 = (MATCH) weight(category:/philosophy/eastern in 4), product of: 0.70710677 = queryWeight(category:/philosophy/eastern), product of: 2.871802 = idf(docFreq=1, maxDocs=13) 0.24622406 = queryNorm 2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of: 1.0 = tf(termFreq(category:/philosophy/eastern)=1) 2.871802 = idf(docFreq=1, maxDocs=13) 1.0 = fieldNorm(field=category, doc=4) 0.5 = coord(1/2) Comparing the results, I see that: - maxDocs and IDF are the same; - queryNorm and coord can be different. The correct values are the ones obtained when using Filter; BooleanClauses introduce artificial query terms that affect these metrics; - the BooleanClause also introduces a ConstantScoreQuery that further impacts the "true" score. I would conclude that from the perspective of obtaining "true" scores, using Filter is preferred to using MUST BooleanClause in a BooleanQuery. The TF-IDF model (as well as other IR models) was developed for text-like features. The assumptions made in that model do not apply to numeric fields such as date or longitude/latitude, appropriate for faceted filtering, so the two models should not be mixed in a common query. Q3. Considering that all expert opinions that I've read in forums speak against Filter-ing, is there something that I'm missing ? -- View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3070439.html Sent from the Lucene - General mailing list archive at Nabble.com.