[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

GitBox Thu, 02 Jun 2022 14:53:06 -0700


pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1145379881

I created a branch with some analysis of what happens, it's
[here](https://github.com/pminkov/lucene/commit/25c5ea4c12d92b8f534d40e449509a327ab6eea9).
The code is a bit hacky, sorry.

**Dataset**

I used one of the MongoDB Atlas datasets -
[mflix](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/). This
dataset has a collection with ~20k movies and I dumped their plot descriptions
into the plots.txt file (it's in the branch).

```commandline
$ cat ./plots.txt | wc -l
23531
```

A sample of the file is
[here](https://gist.github.com/pminkov/c040e96835501bb2bfa34d029c5fa0d9).

**Test**

I sorted the documents by length and cleaned up punctuation, then I indexed
the documents. The documents with lower document ids are biggest.

Next step is I picked 15 documents and created a MLT query from each one.

Here are the terms that are selected for each document:
https://gist.github.com/pminkov/1432b04f794b97d1fc042ffc1ac0dce2

As you can see, when we don't have the fix, the code selects a lot more
stopword like words and that is more visible when you have longer documents.
That I believe happens since the stop words appear many times and if the
frequency is not damped down with a square root (`similarity.tf()`), they tend
to bubble up to the top of the priority queue. On shorter documents there's not
much visible difference.

Let me know if I should elaborate more on any of this or look into something
additional.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

Reply via email to