RE: Duplicate matching

ILT (O) Fri, 28 Nov 2014 17:55:57 -0800

Yes, I use Treesize (Professional) when I need to discover files on disks. I’ve 
had to do it remotely using TeamViewer – hence the Pro version – but a free 
version and also a trial of the Pro version are available as I recall. It’s 
worth a try.


But I’m interested in the algorithm and the code, since it might be useful 
within a program of mine and also in a personal scenario similar to Greg K’s. 

 

  _____  

Ian Thomas
Albert Park, Victoria

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On 
Behalf Of Stephen Price
Sent: Saturday, November 29, 2014 12:30 PM
To: ozDotNet
Subject: Re: Duplicate matching

 

Am curious, is the idea of the exercise to write your own code to solve the 
problem, or to solve the problem? I've used Treesize pro to find file 
duplicates in the past. Also have used Directory Opus to find duplicates. Great 
for finding identical files with different names. Probably won't help if the 
songs are the same song but from a different source. Your file name pattern 
matching code would be the way to go. (Which is also the case if this is a 
programming exercise :)

 

Maybe I'm a lazy coder,  I usually look for someone elses product/code before 
writing my own. I can see the benefit of writing your own too.  

  
<http://t.signaledue.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v4LGzzVdDZcj8qlRZHN5w6vp0g4p7Cf96836-01?si=6200614728499200&pi=27dbf3f9-42ef-41ec-f206-9d6dc151c2c2>
 

 

On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh <g...@mira.net> wrote:

Thanks Greg H, the "weighting" is a very interesting idea. I'm running some 
simple experiments now with a word list and an inverted list of file names, 
just to help me picture the problem in my head. The problem with a weighting 
comparison is that I don't know what to compare with what, comparing 20,000 
file names with every other one might run into the next ice age. However, I 
like the weighting idea, so I might finish up with a hybrid algorithm. I'll let 
you know if anything interesting arises out of this -- Greg K

 

On 29 November 2014 at 11:17, Greg Harris <g...@harrisconsultinggroup.com> 
wrote:

Hi Greg,

 

I should look at my code before I write comments from memory...

The result is a double value being the sum of:

·         number of times the same letter appears in both strings

·         10 times the number of times the same two letters appears in both 
strings

·         100 times the number of times the same three letters appears in both 
strings

Which is then divided by the length of the two strings to sort of “normalise” 
the result.

Mixed case is ignored, only compares letters A-Z and 0-9, everything else is 
excluded.

I added a Greg unit test to better show the results which is following…

 

Regards 

Greg Harris

 

    [TestMethod] public void Test_10_Compare3_ForGregKeogh()

    {

      //                       123456789-123456789-123456789-123456789-12456

      string lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      string lTestLine2     = "Trumpet Concerto (William Lovelock).mp3";

      double lExpected      = 3033/(36.0 + 33.0); // = 43.9

      double lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This is an example of exactly the same string, so will get the best 
posible match

      lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      lTestLine2     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      lExpected      = 5256/(36.0 + 36.0); // = 73.0

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This is an example of exactly the same string, with case difference, 
which is ignored, 

      // so will also get the best possible match

      lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      lTestLine2     = "LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3";

      lExpected      = 5256/(36.0 + 36.0); // = 73.0

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This is an example of a spelling/typing mistake, so will get a very 
good match

      lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      lTestLine2     = "Lovelock - Trumpet Concerto (SoSo Concert).mp3";

      lExpected      = 5272/(36.0 + 37.0); // = 72.2

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This is an example of a truncation, so will get a poor match

      lTestLine1     = "Lovelock - Trumpet Concerto (SSO Concert).mp3";

      lTestLine2     = "Lovelock - Trumpet Concerto.mp3";

      lExpected      = 3237/(36.0 + 26.0); // = 52.2

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This will get a match on William and a little else...

      lTestLine1     = "Trumpet Concerto (William Lovelock).mp3";

      lTestLine2     = "The Complete Works of William Shakespeare.txt";

      lExpected      = 1202/(33.0 + 39.0); // = 16.69

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This will get a match on each of the letters, but no double letters

      lTestLine1     = "QWERTY";

      lTestLine2     = "ytrewq";

      lExpected      = 6/(6.0 + 6.0); // = 0.5

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

 

      // This will get a match on nothing

      lTestLine1     = "QWERTY";

      lTestLine2     = "ASDFGHJKL";

      lExpected      = 0/(6.0 + 9.0); // = 0.0

      lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

    }

 

 

On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris <g...@harrisconsultinggroup.com> 
wrote:

Hi Greg,

 

Please find following what I have used in the past.

It is very expensive, but I can not see a better way of doing it.

It returns an integer which is the sum of:

*       number of times the same letter appears in both strings
*       10 times the number of times the same two letters appears in both 
strings
*       100 times the number of times the same three letters appears in both 
strings

Once you get your results, sort them, the most similar strings will have higher 
results.

I used this many years ago and not used it since.

There may be (far) better ways to do this.

 

Regards

Greg Harris

 

    public static string   CleanStr               ( this string aText )

    {

      int           diff   = 'A' - 'a';

      StringBuilder result = new StringBuilder();

      foreach ( char ch in aText )

      {

        if (    ( ch >= '0' && ch <= '9' ) 

             || ( ch >= 'A' && ch <= 'Z' ) )

        {

          result.Append(ch);

        }

        else

        {

          if ( ch >= 'a' && ch <= 'z' ) 

          {

            result.Append((char)(ch+diff));

          }

 

        }

      }

      return result.ToString();

    }

    /// <summary>

    /// Do a sounds like compare, the higher the result, the more the 
words/phrases sound the same

    /// </summary>

    /// <param name="aStr1">First word / phrase</param>

    /// <param name="aStr2">Second word / phrase</param>

    /// <returns>Score</returns>

    public static double   CompareSoundsLike     ( this string aStr1, string 
aStr2 )

    {

      aStr1 = aStr1.CleanStr();

      aStr2 = aStr2.CleanStr();

      double result = 0;

      for (int i = 0; i < aStr1.Length; i++)

      {

        char outerChar = aStr1[i];

        for (int j = 0; j < aStr2.Length; j++)

        {

          char innerInner = aStr2[j];

          if ( outerChar == innerInner )

          {

            result++;

            if ( ( i < aStr1.Length-1 ) && ( j < aStr2.Length-1 ) && ( 
aStr1[i+1] == aStr2[j+1] ) ) result += 10 ;

            if ( ( i < aStr1.Length-2 ) && ( j < aStr2.Length-2 ) && ( 
aStr1[i+2] == aStr2[j+2] ) ) result += 100;

          }

        }

      }

      return result / ( aStr1.Length + aStr2.Length );

    }

 

 

 

    [TestMethod] public void Test_10_Compare1()

    {

      //                       123456

      string lTestLine1     = "qwerty";

      string lTestLine2     = "QWERTY";

      double lExpected      = 456/(6+6);

      double lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

    }

    [TestMethod] public void Test_10_Compare2()

    {

      //                       123456789-123456789-123456789-123456789-12xxx

      string lTestLine1     = "The quick brown fox jumped over the !@#$ dog!";

      string lTestLine2     = "T H E  -  Q U I C K  -  B R O W N  -  F O X  -  
J U M P E D  -  O V E R  -  T H E  -  D O G";

      double lExpected      = 3856.0/(32.0 + 32.0);

      double lResult        = lTestLine1.CompareSoundsLike( lTestLine2 );

      Assert.AreEqual<double>( lExpected, lResult );

    }

 

 

 

On Sat, Nov 29, 2014 at 9:46 AM, Greg Keogh <g...@mira.net> wrote:

Folks, I was about this write some utility code to search through my 20,000 
audio files looking for probable duplicates. I say "probable" because I found 
file names like these:

 

Lovelock - Trumpet Concerto (SSO Concert).mp3

Trumpet Concerto (William Lovelock).mp3

 

There are many other duplicates with rearranged, abbreviated or misspelt words 
in the names. I was about to click "New Project" and start typing but I 
suddenly realised I had no idea what algorithm to use to find probable 
duplicates and rate them. Has anyone done this sort of thing before or know 
where to find a description of a suitable algorithm?

 

Greg K

RE: Duplicate matching

Reply via email to