Re: Duplicate matching

Greg Keogh Fri, 28 Nov 2014 17:50:34 -0800

Hi Stephen, I wrote a utility in Framework 1.0 that finds duplicate files
by content (builds a dictionary of checksums). In this case the files with
"similar" names might be the same recording at different bitrates, making
them binary different. So it's a bit fuzzy what I'm looking for. Off the
cuff I thought this might be a non-trivial algorithm, like a soundex for
whole file names and there wouldn't be an existing tool for the exact
purpose (I hope I'm wrong!) -- *GK*


On 29 November 2014 at 12:29, Stephen Price <[email protected]>
wrote:

> Am curious, is the idea of the exercise to write your own code to solve
> the problem, or to solve the problem? I've used Treesize pro to find file
> duplicates in the past. Also have used Directory Opus to find duplicates.
> Great for finding identical files with different names. Probably won't help
> if the songs are the same song but from a different source. Your file name
> pattern matching code would be the way to go. (Which is also the case if
> this is a programming exercise :)
>
> Maybe I'm a lazy coder,  I usually look for someone elses product/code
> before writing my own. I can see the benefit of writing your own too.
>
> On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh <[email protected]> wrote:
>
>> Thanks Greg H, the "weighting" is a very interesting idea. I'm running
>> some simple experiments now with a word list and an inverted list of file
>> names, just to help me picture the problem in my head. The problem with a
>> weighting comparison is that I don't know what to compare with what,
>> comparing 20,000 file names with every other one might run into the next
>> ice age. However, I like the weighting idea, so I might finish up with a
>> hybrid algorithm. I'll let you know if anything interesting arises out of
>> this -- *Greg K*
>>
>> On 29 November 2014 at 11:17, Greg Harris <[email protected]
>> > wrote:
>>
>>> Hi Greg,
>>>
>>>
>>> I should look at my code before I write comments from memory...
>>>
>>> The result is a *double *value being the sum of:
>>>
>>> ·         number of times the same letter appears in both strings
>>>
>>> ·         10 times the number of times the same two letters appears in
>>> both strings
>>>
>>> ·         100 times the number of times the same three letters appears
>>> in both strings
>>>
>>> *Which is then divided by the length of the two strings to sort of
>>> “normalise” the result.*
>>>
>>> Mixed case is ignored, only compares letters A-Z and 0-9, everything
>>> else is excluded.
>>>
>>> I added a Greg unit test to better show the results which is following…
>>>
>>>
>>> Regards
>>>
>>> Greg Harris
>>>
>>>
>>>
>>>     [TestMethod] public void Test_10_Compare3_ForGregKeogh()
>>>
>>>     {
>>>
>>>       //
>>> 123456789-123456789-123456789-123456789-12456
>>>
>>>       string lTestLine1     = "Lovelock - Trumpet Concerto (SSO
>>> Concert).mp3";
>>>
>>>       string lTestLine2     = "Trumpet Concerto (William Lovelock).mp3";
>>>
>>>       double lExpected      = 3033/(36.0 + 33.0); // = 43.9
>>>
>>>       double lResult        = lTestLine1.CompareSoundsLike( lTestLine2
>>> );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This is an example of exactly the same string, so will get the
>>> best posible match
>>>
>>>       lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";
>>>
>>>       lTestLine2     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";
>>>
>>>       lExpected      = 5256/(36.0 + 36.0); // = 73.0
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This is an example of exactly the same string, with case
>>> difference, which is ignored,
>>>
>>>       // so will also get the best possible match
>>>
>>>       lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";
>>>
>>>       lTestLine2     = "LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3";
>>>
>>>       lExpected      = 5256/(36.0 + 36.0); // = 73.0
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This is an example of a spelling/typing mistake, so will get a
>>> very good match
>>>
>>>       lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";
>>>
>>>       lTestLine2     = "Lovelock - Trumpet Concerto (SoSo Concert).mp3";
>>>
>>>       lExpected      = 5272/(36.0 + 37.0); // = 72.2
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This is an example of a truncation, so will get a poor match
>>>
>>>       lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";
>>>
>>>       lTestLine2     = "Lovelock - Trumpet Concerto.mp3";
>>>
>>>       lExpected      = 3237/(36.0 + 26.0); // = 52.2
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This will get a match on William and a little else...
>>>
>>>       lTestLine1     = "Trumpet Concerto (William Lovelock).mp3";
>>>
>>>       lTestLine2     = "The Complete Works of William Shakespeare.txt";
>>>
>>>       lExpected      = 1202/(33.0 + 39.0); // = 16.69
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This will get a match on each of the letters, but no double
>>> letters
>>>
>>>       lTestLine1     = "QWERTY";
>>>
>>>       lTestLine2     = "ytrewq";
>>>
>>>       lExpected      = 6/(6.0 + 6.0); // = 0.5
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>
>>>
>>>       // This will get a match on nothing
>>>
>>>       lTestLine1     = "QWERTY";
>>>
>>>       lTestLine2     = "ASDFGHJKL";
>>>
>>>       lExpected      = 0/(6.0 + 9.0); // = 0.0
>>>
>>>       lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );
>>>
>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>
>>>     }
>>>
>>>
>>>
>>> On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris <
>>> [email protected]> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>> Please find following what I have used in the past.
>>>> It is very expensive, but I can not see a better way of doing it.
>>>> It returns an integer which is the sum of:
>>>>
>>>>    - number of times the same letter appears in both strings
>>>>    - 10 times the number of times the same two letters appears in both
>>>>    strings
>>>>    - 100 times the number of times the same three letters appears in
>>>>    both strings
>>>>
>>>> Once you get your results, sort them, the most similar strings will
>>>> have higher results.
>>>> I used this many years ago and not used it since.
>>>> There may be (far) better ways to do this.
>>>>
>>>> Regards
>>>> Greg Harris
>>>>
>>>>     public static string   CleanStr               ( this string aText )
>>>>
>>>>     {
>>>>
>>>>       int           diff   = 'A' - 'a';
>>>>
>>>>       StringBuilder result = new StringBuilder();
>>>>
>>>>       foreach ( char ch in aText )
>>>>
>>>>       {
>>>>
>>>>         if (    ( ch >= '0' && ch <= '9' )
>>>>
>>>>              || ( ch >= 'A' && ch <= 'Z' ) )
>>>>
>>>>         {
>>>>
>>>>           result.Append(ch);
>>>>
>>>>         }
>>>>
>>>>         else
>>>>
>>>>         {
>>>>
>>>>           if ( ch >= 'a' && ch <= 'z' )
>>>>
>>>>           {
>>>>
>>>>             result.Append((char)(ch+diff));
>>>>
>>>>           }
>>>>
>>>>
>>>>
>>>>         }
>>>>
>>>>       }
>>>>
>>>>       return result.ToString();
>>>>
>>>>     }
>>>>
>>>>     /// <summary>
>>>>
>>>>     /// Do a sounds like compare, the higher the result, the more the
>>>> words/phrases sound the same
>>>>
>>>>     /// </summary>
>>>>
>>>>     /// <param name="aStr1">First word / phrase</param>
>>>>
>>>>     /// <param name="aStr2">Second word / phrase</param>
>>>>
>>>>     /// <returns>Score</returns>
>>>>
>>>>     public static double   CompareSoundsLike     ( this string aStr1,
>>>> string aStr2 )
>>>>
>>>>     {
>>>>
>>>>       aStr1 = aStr1.CleanStr();
>>>>
>>>>       aStr2 = aStr2.CleanStr();
>>>>
>>>>       double result = 0;
>>>>
>>>>       for (int i = 0; i < aStr1.Length; i++)
>>>>
>>>>       {
>>>>
>>>>         char outerChar = aStr1[i];
>>>>
>>>>         for (int j = 0; j < aStr2.Length; j++)
>>>>
>>>>         {
>>>>
>>>>           char innerInner = aStr2[j];
>>>>
>>>>           if ( outerChar == innerInner )
>>>>
>>>>           {
>>>>
>>>>             result++;
>>>>
>>>>             if ( ( i < aStr1.Length-1 ) && ( j < aStr2.Length-1 ) && (
>>>> aStr1[i+1] == aStr2[j+1] ) ) result += 10 ;
>>>>
>>>>             if ( ( i < aStr1.Length-2 ) && ( j < aStr2.Length-2 ) && (
>>>> aStr1[i+2] == aStr2[j+2] ) ) result += 100;
>>>>
>>>>           }
>>>>
>>>>         }
>>>>
>>>>       }
>>>>
>>>>       return result / ( aStr1.Length + aStr2.Length );
>>>>
>>>>     }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>     [TestMethod] public void Test_10_Compare1()
>>>>
>>>>     {
>>>>
>>>>       //                       123456
>>>>
>>>>       string lTestLine1     = "qwerty";
>>>>
>>>>       string lTestLine2     = "QWERTY";
>>>>
>>>>       double lExpected      = 456/(6+6);
>>>>
>>>>       double lResult        = lTestLine1.CompareSoundsLike( lTestLine2
>>>> );
>>>>
>>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>>
>>>>     }
>>>>
>>>>     [TestMethod] public void Test_10_Compare2()
>>>>
>>>>     {
>>>>
>>>>       //
>>>> 123456789-123456789-123456789-123456789-12xxx
>>>>
>>>>       string lTestLine1     = "The quick brown fox jumped over the
>>>> !@#$ dog!";
>>>>
>>>>       string lTestLine2     = "T H E  -  Q U I C K  -  B R O W N  -  F
>>>> O X  -  J U M P E D  -  O V E R  -  T H E  -  D O G";
>>>>
>>>>       double lExpected      = 3856.0/(32.0 + 32.0);
>>>>
>>>>       double lResult        = lTestLine1.CompareSoundsLike( lTestLine2
>>>> );
>>>>
>>>>       Assert.AreEqual<double>( lExpected, lResult );
>>>>
>>>>     }
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Nov 29, 2014 at 9:46 AM, Greg Keogh <[email protected]> wrote:
>>>>
>>>>> Folks, I was about this write some utility code to search through my
>>>>> 20,000 audio files looking for probable duplicates. I say "probable"
>>>>> because I found file names like these:
>>>>>
>>>>> Lovelock - Trumpet Concerto (SSO Concert).mp3
>>>>> Trumpet Concerto (William Lovelock).mp3
>>>>>
>>>>> There are many other duplicates with rearranged, abbreviated or
>>>>> misspelt words in the names. I was about to click "New Project" and start
>>>>> typing but I suddenly realised I had no idea what algorithm to use to find
>>>>> probable duplicates and rate them. Has anyone done this sort of thing
>>>>> before or know where to find a description of a suitable algorithm?
>>>>>
>>>>> *Greg K*
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Duplicate matching

Reply via email to