I'm hoping that coord similarity API can be changed from: float coord(int overlap, int maxOverlap)
TO float coord(int overlap, int maxOverlap, int docSize) Where docSize is the num Terms in the document/hit being evaluated for similarity to the query. The reason for this is that many people are using Lucene to match documents that are not web pages, and in these cases, the size of the query and the document MUST be similar sizes. For example ... If your documents are cars, and there's a 3 styles of a volvo wagon, say: - "Volvo V70 Wagon" (just the "normal" edition) - "Volvo V70 Wagon Luxury Edition" - "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD" If somebody searches for a longer name, like "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD", then the normal edition "Volvo V70 Wagon" will be excluded most likely due to the coord factor only having 3/8 hits. **However**, in the reverse situation, if somebody wants to search for the normal wagon, "Volvo V70 Wagon", it will match all 3 of these w/ the same score. Nothing can help here, changing lengthNorm to intentionally lower the score of car names as they get longer doesn't make sense, the "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD" is just as much of a car as the "Volvo V70 Wagon", so the lengthNorm is using the "SweetSpot" or "Plateau" methodology, and anything between 2 words and about 10 are all legit values. So, back to my orig request. By changing coord to also have the length of the matching document, it would allow coord to lower scores on docs that are not similar length to the orig query. Again, searching "Volvo V70 Wagon", when the hit for "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD", is analyzed, the coord would tell me that it has 8 terms, vs the 3 that i'm looking for, and then i could apply any algorithm i want to reduce the hit score (in this case, most likely returning 3/8). However, if your application does consider those hits all the same, then u could leave its current implementation as is, and return a 1. Hopefully this makes sense. I'm (sort of) aware that this could be coded up myself by doing a custom query and scorer class, but I think it warrants being added to the abstract similarity class. I'm not a pro on lucene so I could be missing something, thank you for reading. Sincerely, John