# Re: How to pick out the same titles.

```On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
>
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
>
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
>
> What I would like to do it output to another file that show those two
> as a match.
>
> I don't know the best way to tackle this.  I would think you would
> have to pair the titles with the most consecutive letters in a row.
>
> Anyone want this as a practice exercise?  I don't really use
> programming enough to remember how.
> ```
```
Tokenize, generate (token) set similarity scores and cluster on
similarity score.

>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7

You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).

# n >= 1
# returns a list of n-grams
# or an empty list if n > len(s)
s = '_' * (n-1) + s + '_' * (n-1)
return [s[i:i+n] for i in range(len(s)-n+1)]

def Jaccard(tokens1, tokens2):
# returns exact Jaccard
# similarity measure for
# two token sets
tokens1 = set(tokens1)
tokens2 = set(tokens2)
return len(tokens1&tokens2) / len(tokens1|tokens2)

Duncan

--
https://mail.python.org/mailman/listinfo/python-list
```