pminkov commented on PR #940: URL: https://github.com/apache/lucene/pull/940#issuecomment-1145379881
I created a branch with some analysis of what happens, it's [here](https://github.com/pminkov/lucene/commit/25c5ea4c12d92b8f534d40e449509a327ab6eea9). The code is a bit hacky, sorry. **Dataset** I used one of the MongoDB Atlas datasets - [mflix](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/). This dataset has a collection with ~20k movies and I dumped their plot descriptions into the plots.txt file (it's in the branch). ```commandline $ cat ./plots.txt | wc -l 23531 ``` A sample of the file is [here](https://gist.github.com/pminkov/c040e96835501bb2bfa34d029c5fa0d9). **Test** I sorted the documents by length and cleaned up punctuation, then I indexed the documents. The documents with lower document ids are biggest. Next step is I picked 15 documents and created a MLT query from each one. Here are the terms that are selected for each document: https://gist.github.com/pminkov/1432b04f794b97d1fc042ffc1ac0dce2 As you can see, when we don't have the fix, the code selects a lot more stopword like words and that is more visible when you have longer documents. That I believe happens since the stop words appear many times and if the frequency is not damped down with a square root (`similarity.tf()`), they tend to bubble up to the top of the priority queue. On shorter documents there's not much visible difference. Let me know if I should elaborate more on any of this or look into something additional. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org