pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1145379881

   I created a branch with some analysis of what happens, it's 
[here](https://github.com/pminkov/lucene/commit/25c5ea4c12d92b8f534d40e449509a327ab6eea9).
 The code is a bit hacky, sorry.
   
   **Dataset**
   
   I used one of the MongoDB Atlas datasets - 
[mflix](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/). This 
dataset has a collection with ~20k movies and I dumped their plot descriptions 
into the plots.txt file (it's in the branch).
   
   ```commandline
   $ cat ./plots.txt | wc -l
      23531
   ```   
   
   A sample of the file is 
[here](https://gist.github.com/pminkov/c040e96835501bb2bfa34d029c5fa0d9).
    
   **Test**
   
   I sorted the documents by length and cleaned up punctuation, then I indexed 
the documents. The documents with lower document ids are biggest. 
   
   Next step is I picked 15 documents and created a MLT query from each one.
   
   Here are the terms that are selected for each document: 
https://gist.github.com/pminkov/1432b04f794b97d1fc042ffc1ac0dce2
   
   As you can see, when we don't have the fix, the code selects a lot more 
stopword like words and that is more visible when you have longer documents. 
That I believe happens since the stop words appear many times and if the 
frequency is not damped down with a square root (`similarity.tf()`), they tend 
to bubble up to the top of the priority queue. On shorter documents there's not 
much visible difference.
   
   Let me know if I should elaborate more on any of this or look into something 
additional.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to