RE: Duplicate matching

2014-11-30 Thread Adrian Halid
Hi Greg,

Instead of using the filename to determine duplicate audio files have you 
considered using an audio fingerprint?

I have used this software in the past to automatically tag my music.
https://picard.musicbrainz.org/
“Picard uses AcoustID audio fingerprints, allowing files to be identified by 
the actual music, even if they have no metadata”

Apparently it uses http://acoustid.org/ which is an open source library.


Regards

Adrian Halid


From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On 
Behalf Of Greg Keogh
Sent: Saturday, 29 November 2014 6:46 AM
To: ozDotNet
Subject: Duplicate matching

Folks, I was about this write some utility code to search through my 20,000 
audio files looking for probable duplicates. I say probable because I found 
file names like these:

Lovelock - Trumpet Concerto (SSO Concert).mp3
Trumpet Concerto (William Lovelock).mp3

There are many other duplicates with rearranged, abbreviated or misspelt words 
in the names. I was about to click New Project and start typing but I 
suddenly realised I had no idea what algorithm to use to find probable 
duplicates and rate them. Has anyone done this sort of thing before or know 
where to find a description of a suitable algorithm?

Greg K


Re: Duplicate matching

2014-11-30 Thread Greg Keogh

 Instead of using the filename to determine duplicate audio files have you
 considered using an audio fingerprint? ... Apparently it uses
 http://acoustid.org/ which is an open source library.


This is an interesting lateral-thinking idea. That's an ambitious and
scientifically interesting project. I can't submit 90GB of music to their
web service, but if the algorithm can be run locally to generate the
fingerprints then it would efficient to lookup their database -- *Greg K*


RE: Duplicate matching

2014-11-30 Thread Adrian Halid
I am pretty sure the fingerprints are calculated on your local machine. It 
would only be the finger print that you send to the web service.

This is how the Picard application works.

Regards

Adrian Halid


From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On 
Behalf Of Greg Keogh
Sent: Monday, 1 December 2014 8:35 AM
To: ozDotNet
Subject: Re: Duplicate matching

Instead of using the filename to determine duplicate audio files have you 
considered using an audio fingerprint? ... Apparently it uses 
http://acoustid.org/ which is an open source library.

This is an interesting lateral-thinking idea. That's an ambitious and 
scientifically interesting project. I can't submit 90GB of music to their web 
service, but if the algorithm can be run locally to generate the fingerprints 
then it would efficient to lookup their database -- Greg K


Re: Duplicate matching

2014-11-28 Thread Greg Harris
Hi Greg,

Please find following what I have used in the past.
It is very expensive, but I can not see a better way of doing it.
It returns an integer which is the sum of:

   - number of times the same letter appears in both strings
   - 10 times the number of times the same two letters appears in both
   strings
   - 100 times the number of times the same three letters appears in both
   strings

Once you get your results, sort them, the most similar strings will have
higher results.
I used this many years ago and not used it since.
There may be (far) better ways to do this.

Regards
Greg Harris

public static string   CleanStr   ( this string aText )

{

  int   diff   = 'A' - 'a';

  StringBuilder result = new StringBuilder();

  foreach ( char ch in aText )

  {

if (( ch = '0'  ch = '9' )

 || ( ch = 'A'  ch = 'Z' ) )

{

  result.Append(ch);

}

else

{

  if ( ch = 'a'  ch = 'z' )

  {

result.Append((char)(ch+diff));

  }



}

  }

  return result.ToString();

}

/// summary

/// Do a sounds like compare, the higher the result, the more the
words/phrases sound the same

/// /summary

/// param name=aStr1First word / phrase/param

/// param name=aStr2Second word / phrase/param

/// returnsScore/returns

public static double   CompareSoundsLike ( this string aStr1, string
aStr2 )

{

  aStr1 = aStr1.CleanStr();

  aStr2 = aStr2.CleanStr();

  double result = 0;

  for (int i = 0; i  aStr1.Length; i++)

  {

char outerChar = aStr1[i];

for (int j = 0; j  aStr2.Length; j++)

{

  char innerInner = aStr2[j];

  if ( outerChar == innerInner )

  {

result++;

if ( ( i  aStr1.Length-1 )  ( j  aStr2.Length-1 )  (
aStr1[i+1] == aStr2[j+1] ) ) result += 10 ;

if ( ( i  aStr1.Length-2 )  ( j  aStr2.Length-2 )  (
aStr1[i+2] == aStr2[j+2] ) ) result += 100;

  }

}

  }

  return result / ( aStr1.Length + aStr2.Length );

}







[TestMethod] public void Test_10_Compare1()

{

  //   123456

  string lTestLine1 = qwerty;

  string lTestLine2 = QWERTY;

  double lExpected  = 456/(6+6);

  double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

}

[TestMethod] public void Test_10_Compare2()

{

  //   123456789-123456789-123456789-123456789-12xxx

  string lTestLine1 = The quick brown fox jumped over the !@#$
dog!;

  string lTestLine2 = T H E  -  Q U I C K  -  B R O W N  -  F O X
-  J U M P E D  -  O V E R  -  T H E  -  D O G;

  double lExpected  = 3856.0/(32.0 + 32.0);

  double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

}




On Sat, Nov 29, 2014 at 9:46 AM, Greg Keogh g...@mira.net wrote:

 Folks, I was about this write some utility code to search through my
 20,000 audio files looking for probable duplicates. I say probable
 because I found file names like these:

 Lovelock - Trumpet Concerto (SSO Concert).mp3
 Trumpet Concerto (William Lovelock).mp3

 There are many other duplicates with rearranged, abbreviated or misspelt
 words in the names. I was about to click New Project and start typing but
 I suddenly realised I had no idea what algorithm to use to find probable
 duplicates and rate them. Has anyone done this sort of thing before or know
 where to find a description of a suitable algorithm?

 *Greg K*



Re: Duplicate matching

2014-11-28 Thread Greg Harris
Hi Greg,


I should look at my code before I write comments from memory...

The result is a *double *value being the sum of:

· number of times the same letter appears in both strings

· 10 times the number of times the same two letters appears in both
strings

· 100 times the number of times the same three letters appears in
both strings

*Which is then divided by the length of the two strings to sort of
“normalise” the result.*

Mixed case is ignored, only compares letters A-Z and 0-9, everything else
is excluded.

I added a Greg unit test to better show the results which is following…


Regards

Greg Harris



[TestMethod] public void Test_10_Compare3_ForGregKeogh()

{

  //   123456789-123456789-123456789-123456789-12456

  string lTestLine1 = Lovelock - Trumpet Concerto (SSO
Concert).mp3;

  string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

  double lExpected  = 3033/(36.0 + 33.0); // = 43.9

  double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of exactly the same string, so will get the
best posible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of exactly the same string, with case
difference, which is ignored,

  // so will also get the best possible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of a spelling/typing mistake, so will get a
very good match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

  lExpected  = 5272/(36.0 + 37.0); // = 72.2

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of a truncation, so will get a poor match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto.mp3;

  lExpected  = 3237/(36.0 + 26.0); // = 52.2

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This will get a match on William and a little else...

  lTestLine1 = Trumpet Concerto (William Lovelock).mp3;

  lTestLine2 = The Complete Works of William Shakespeare.txt;

  lExpected  = 1202/(33.0 + 39.0); // = 16.69

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This will get a match on each of the letters, but no double letters

  lTestLine1 = QWERTY;

  lTestLine2 = ytrewq;

  lExpected  = 6/(6.0 + 6.0); // = 0.5

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This will get a match on nothing

  lTestLine1 = QWERTY;

  lTestLine2 = ASDFGHJKL;

  lExpected  = 0/(6.0 + 9.0); // = 0.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

}



On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris 
g...@harrisconsultinggroup.com wrote:

 Hi Greg,

 Please find following what I have used in the past.
 It is very expensive, but I can not see a better way of doing it.
 It returns an integer which is the sum of:

- number of times the same letter appears in both strings
- 10 times the number of times the same two letters appears in both
strings
- 100 times the number of times the same three letters appears in both
strings

 Once you get your results, sort them, the most similar strings will have
 higher results.
 I used this many years ago and not used it since.
 There may be (far) better ways to do this.

 Regards
 Greg Harris

 public static string   CleanStr   ( this string aText )

 {

   int   diff   = 'A' - 'a';

   StringBuilder result = new StringBuilder();

   foreach ( char ch in aText )

   {

 if (( ch = '0'  ch = '9' )

  || ( ch = 'A'  ch = 'Z' ) )

 {

   result.Append(ch);

 }

 else

 {

   if ( ch = 'a'  ch = 'z' )

   {

 result.Append((char)(ch+diff));

   }



 }

   }

 

Re: Duplicate matching

2014-11-28 Thread Greg Keogh
Thanks Greg H, the weighting is a very interesting idea. I'm running some
simple experiments now with a word list and an inverted list of file names,
just to help me picture the problem in my head. The problem with a
weighting comparison is that I don't know what to compare with what,
comparing 20,000 file names with every other one might run into the next
ice age. However, I like the weighting idea, so I might finish up with a
hybrid algorithm. I'll let you know if anything interesting arises out of
this -- *Greg K*

On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com
wrote:

 Hi Greg,


 I should look at my code before I write comments from memory...

 The result is a *double *value being the sum of:

 · number of times the same letter appears in both strings

 · 10 times the number of times the same two letters appears in
 both strings

 · 100 times the number of times the same three letters appears in
 both strings

 *Which is then divided by the length of the two strings to sort of
 “normalise” the result.*

 Mixed case is ignored, only compares letters A-Z and 0-9, everything else
 is excluded.

 I added a Greg unit test to better show the results which is following…


 Regards

 Greg Harris



 [TestMethod] public void Test_10_Compare3_ForGregKeogh()

 {

   //
 123456789-123456789-123456789-123456789-12456

   string lTestLine1 = Lovelock - Trumpet Concerto (SSO
 Concert).mp3;

   string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

   double lExpected  = 3033/(36.0 + 33.0); // = 43.9

   double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, so will get the
 best posible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, with case
 difference, which is ignored,

   // so will also get the best possible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a spelling/typing mistake, so will get a
 very good match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

   lExpected  = 5272/(36.0 + 37.0); // = 72.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a truncation, so will get a poor match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto.mp3;

   lExpected  = 3237/(36.0 + 26.0); // = 52.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on William and a little else...

   lTestLine1 = Trumpet Concerto (William Lovelock).mp3;

   lTestLine2 = The Complete Works of William Shakespeare.txt;

   lExpected  = 1202/(33.0 + 39.0); // = 16.69

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on each of the letters, but no double
 letters

   lTestLine1 = QWERTY;

   lTestLine2 = ytrewq;

   lExpected  = 6/(6.0 + 6.0); // = 0.5

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on nothing

   lTestLine1 = QWERTY;

   lTestLine2 = ASDFGHJKL;

   lExpected  = 0/(6.0 + 9.0); // = 0.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );

 }



 On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris 
 g...@harrisconsultinggroup.com wrote:

 Hi Greg,

 Please find following what I have used in the past.
 It is very expensive, but I can not see a better way of doing it.
 It returns an integer which is the sum of:

- number of times the same letter appears in both strings
- 10 times the number of times the same two letters appears in both
strings
- 100 times the number of times the same three letters appears in
both strings

 Once you get your results, sort them, the most similar strings will have
 

Re: Duplicate matching

2014-11-28 Thread Stephen Price
Am curious, is the idea of the exercise to write your own code to solve the
problem, or to solve the problem? I've used Treesize pro to find file
duplicates in the past. Also have used Directory Opus to find duplicates.
Great for finding identical files with different names. Probably won't help
if the songs are the same song but from a different source. Your file name
pattern matching code would be the way to go. (Which is also the case if
this is a programming exercise :)

Maybe I'm a lazy coder,  I usually look for someone elses product/code
before writing my own. I can see the benefit of writing your own too.

On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote:

 Thanks Greg H, the weighting is a very interesting idea. I'm running
 some simple experiments now with a word list and an inverted list of file
 names, just to help me picture the problem in my head. The problem with a
 weighting comparison is that I don't know what to compare with what,
 comparing 20,000 file names with every other one might run into the next
 ice age. However, I like the weighting idea, so I might finish up with a
 hybrid algorithm. I'll let you know if anything interesting arises out of
 this -- *Greg K*

 On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com
 wrote:

 Hi Greg,


 I should look at my code before I write comments from memory...

 The result is a *double *value being the sum of:

 · number of times the same letter appears in both strings

 · 10 times the number of times the same two letters appears in
 both strings

 · 100 times the number of times the same three letters appears
 in both strings

 *Which is then divided by the length of the two strings to sort of
 “normalise” the result.*

 Mixed case is ignored, only compares letters A-Z and 0-9, everything else
 is excluded.

 I added a Greg unit test to better show the results which is following…


 Regards

 Greg Harris



 [TestMethod] public void Test_10_Compare3_ForGregKeogh()

 {

   //
 123456789-123456789-123456789-123456789-12456

   string lTestLine1 = Lovelock - Trumpet Concerto (SSO
 Concert).mp3;

   string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

   double lExpected  = 3033/(36.0 + 33.0); // = 43.9

   double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, so will get the
 best posible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, with case
 difference, which is ignored,

   // so will also get the best possible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a spelling/typing mistake, so will get a
 very good match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

   lExpected  = 5272/(36.0 + 37.0); // = 72.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a truncation, so will get a poor match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto.mp3;

   lExpected  = 3237/(36.0 + 26.0); // = 52.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on William and a little else...

   lTestLine1 = Trumpet Concerto (William Lovelock).mp3;

   lTestLine2 = The Complete Works of William Shakespeare.txt;

   lExpected  = 1202/(33.0 + 39.0); // = 16.69

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on each of the letters, but no double
 letters

   lTestLine1 = QWERTY;

   lTestLine2 = ytrewq;

   lExpected  = 6/(6.0 + 6.0); // = 0.5

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on nothing

   lTestLine1 = QWERTY;

   lTestLine2 = ASDFGHJKL;

   lExpected  = 0/(6.0 + 9.0); // = 0.0

   

Re: Duplicate matching

2014-11-28 Thread Greg Keogh
Hi Stephen, I wrote a utility in Framework 1.0 that finds duplicate files
by content (builds a dictionary of checksums). In this case the files with
similar names might be the same recording at different bitrates, making
them binary different. So it's a bit fuzzy what I'm looking for. Off the
cuff I thought this might be a non-trivial algorithm, like a soundex for
whole file names and there wouldn't be an existing tool for the exact
purpose (I hope I'm wrong!) -- *GK*

On 29 November 2014 at 12:29, Stephen Price step...@perthprojects.com
wrote:

 Am curious, is the idea of the exercise to write your own code to solve
 the problem, or to solve the problem? I've used Treesize pro to find file
 duplicates in the past. Also have used Directory Opus to find duplicates.
 Great for finding identical files with different names. Probably won't help
 if the songs are the same song but from a different source. Your file name
 pattern matching code would be the way to go. (Which is also the case if
 this is a programming exercise :)

 Maybe I'm a lazy coder,  I usually look for someone elses product/code
 before writing my own. I can see the benefit of writing your own too.

 On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote:

 Thanks Greg H, the weighting is a very interesting idea. I'm running
 some simple experiments now with a word list and an inverted list of file
 names, just to help me picture the problem in my head. The problem with a
 weighting comparison is that I don't know what to compare with what,
 comparing 20,000 file names with every other one might run into the next
 ice age. However, I like the weighting idea, so I might finish up with a
 hybrid algorithm. I'll let you know if anything interesting arises out of
 this -- *Greg K*

 On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com
  wrote:

 Hi Greg,


 I should look at my code before I write comments from memory...

 The result is a *double *value being the sum of:

 · number of times the same letter appears in both strings

 · 10 times the number of times the same two letters appears in
 both strings

 · 100 times the number of times the same three letters appears
 in both strings

 *Which is then divided by the length of the two strings to sort of
 “normalise” the result.*

 Mixed case is ignored, only compares letters A-Z and 0-9, everything
 else is excluded.

 I added a Greg unit test to better show the results which is following…


 Regards

 Greg Harris



 [TestMethod] public void Test_10_Compare3_ForGregKeogh()

 {

   //
 123456789-123456789-123456789-123456789-12456

   string lTestLine1 = Lovelock - Trumpet Concerto (SSO
 Concert).mp3;

   string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

   double lExpected  = 3033/(36.0 + 33.0); // = 43.9

   double lResult= lTestLine1.CompareSoundsLike( lTestLine2
 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, so will get the
 best posible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of exactly the same string, with case
 difference, which is ignored,

   // so will also get the best possible match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

   lExpected  = 5256/(36.0 + 36.0); // = 73.0

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a spelling/typing mistake, so will get a
 very good match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

   lExpected  = 5272/(36.0 + 37.0); // = 72.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This is an example of a truncation, so will get a poor match

   lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

   lTestLine2 = Lovelock - Trumpet Concerto.mp3;

   lExpected  = 3237/(36.0 + 26.0); // = 52.2

   lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

   Assert.AreEqualdouble( lExpected, lResult );



   // This will get a match on William and a little else...

   lTestLine1 = Trumpet Concerto (William Lovelock).mp3;

   lTestLine2 = The Complete Works of William Shakespeare.txt;

   lExpected  = 1202/(33.0 + 39.0); // = 16.69

   lResult= 

RE: Duplicate matching

2014-11-28 Thread ILT (O)
Yes, I use Treesize (Professional) when I need to discover files on disks. I’ve 
had to do it remotely using TeamViewer – hence the Pro version – but a free 
version and also a trial of the Pro version are available as I recall. It’s 
worth a try.

But I’m interested in the algorithm and the code, since it might be useful 
within a program of mine and also in a personal scenario similar to Greg K’s. 

 

  _  

Ian Thomas
Albert Park, Victoria

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On 
Behalf Of Stephen Price
Sent: Saturday, November 29, 2014 12:30 PM
To: ozDotNet
Subject: Re: Duplicate matching

 

Am curious, is the idea of the exercise to write your own code to solve the 
problem, or to solve the problem? I've used Treesize pro to find file 
duplicates in the past. Also have used Directory Opus to find duplicates. Great 
for finding identical files with different names. Probably won't help if the 
songs are the same song but from a different source. Your file name pattern 
matching code would be the way to go. (Which is also the case if this is a 
programming exercise :)

 

Maybe I'm a lazy coder,  I usually look for someone elses product/code before 
writing my own. I can see the benefit of writing your own too.  

  
http://t.signaledue.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v4LGzzVdDZcj8qlRZHN5w6vp0g4p7Cf96836-01?si=6200614728499200pi=27dbf3f9-42ef-41ec-f206-9d6dc151c2c2
 

 

On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote:

Thanks Greg H, the weighting is a very interesting idea. I'm running some 
simple experiments now with a word list and an inverted list of file names, 
just to help me picture the problem in my head. The problem with a weighting 
comparison is that I don't know what to compare with what, comparing 20,000 
file names with every other one might run into the next ice age. However, I 
like the weighting idea, so I might finish up with a hybrid algorithm. I'll let 
you know if anything interesting arises out of this -- Greg K

 

On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com 
wrote:

Hi Greg,

 

I should look at my code before I write comments from memory...

The result is a double value being the sum of:

· number of times the same letter appears in both strings

· 10 times the number of times the same two letters appears in both 
strings

· 100 times the number of times the same three letters appears in both 
strings

Which is then divided by the length of the two strings to sort of “normalise” 
the result.

Mixed case is ignored, only compares letters A-Z and 0-9, everything else is 
excluded.

I added a Greg unit test to better show the results which is following…

 

Regards 

Greg Harris

 

[TestMethod] public void Test_10_Compare3_ForGregKeogh()

{

  //   123456789-123456789-123456789-123456789-12456

  string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

  double lExpected  = 3033/(36.0 + 33.0); // = 43.9

  double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

 

  // This is an example of exactly the same string, so will get the best 
posible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

 

  // This is an example of exactly the same string, with case difference, 
which is ignored, 

  // so will also get the best possible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

 

  // This is an example of a spelling/typing mistake, so will get a very 
good match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

  lExpected  = 5272/(36.0 + 37.0); // = 72.2

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

 

  // This is an example of a truncation, so will get a poor match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto.mp3;

  lExpected  = 3237/(36.0 + 26.0); // = 52.2

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );

 

  // This will get a match on William

Re: Duplicate matching

2014-11-28 Thread Stephen Price
Beyond compare has a dedicated viewer for MP3 files but it looks like it
compares the tags not the actual audio. I think for comparing files and
folders it's awesome but not sure if it can be used to find the duplicates
in a single folder. Also it would need the MP3 tags to be correct (which
there are tools for and I think this list has discussed music tools
previously)

As a side note I deleted all of my music and use music subscriptions now.
Switched between several and have finally settled (for now) on Google music
pass. So I don't have the problem of duplicate music now :)

Sent from my iPhone

On 29 Nov 2014, at 9:54 am, ILT (O) il.tho...@outlook.com wrote:

Yes, I use Treesize (Professional) when I need to discover files on disks.
I’ve had to do it remotely using TeamViewer – hence the Pro version – but a
free version and also a trial of the Pro version are available as I recall.
It’s worth a try.

But I’m interested in the algorithm and the code, since it might be useful
within a program of mine and also in a personal scenario similar to Greg
K’s.


--

Ian Thomas
Albert Park, Victoria



*From:* ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com
ozdotnet-boun...@ozdotnet.com] *On Behalf Of *Stephen Price
*Sent:* Saturday, November 29, 2014 12:30 PM
*To:* ozDotNet
*Subject:* Re: Duplicate matching



Am curious, is the idea of the exercise to write your own code to solve the
problem, or to solve the problem? I've used Treesize pro to find file
duplicates in the past. Also have used Directory Opus to find duplicates.
Great for finding identical files with different names. Probably won't help
if the songs are the same song but from a different source. Your file name
pattern matching code would be the way to go. (Which is also the case if
this is a programming exercise :)



Maybe I'm a lazy coder,  I usually look for someone elses product/code
before writing my own. I can see the benefit of writing your own too.



On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote:

Thanks Greg H, the weighting is a very interesting idea. I'm running some
simple experiments now with a word list and an inverted list of file names,
just to help me picture the problem in my head. The problem with a
weighting comparison is that I don't know what to compare with what,
comparing 20,000 file names with every other one might run into the next
ice age. However, I like the weighting idea, so I might finish up with a
hybrid algorithm. I'll let you know if anything interesting arises out of
this -- *Greg K*



On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com
wrote:

Hi Greg,



I should look at my code before I write comments from memory...

The result is a *double *value being the sum of:

· number of times the same letter appears in both strings

· 10 times the number of times the same two letters appears in both
strings

· 100 times the number of times the same three letters appears in
both strings

*Which is then divided by the length of the two strings to sort of
“normalise” the result.*

Mixed case is ignored, only compares letters A-Z and 0-9, everything else
is excluded.

I added a Greg unit test to better show the results which is following…



Regards

Greg Harris



[TestMethod] public void Test_10_Compare3_ForGregKeogh()

{

  //   123456789-123456789-123456789-123456789-12456

  string lTestLine1 = Lovelock - Trumpet Concerto (SSO
Concert).mp3;

  string lTestLine2 = Trumpet Concerto (William Lovelock).mp3;

  double lExpected  = 3033/(36.0 + 33.0); // = 43.9

  double lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of exactly the same string, so will get the
best posible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of exactly the same string, with case
difference, which is ignored,

  // so will also get the best possible match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3;

  lExpected  = 5256/(36.0 + 36.0); // = 73.0

  lResult= lTestLine1.CompareSoundsLike( lTestLine2 );

  Assert.AreEqualdouble( lExpected, lResult );



  // This is an example of a spelling/typing mistake, so will get a
very good match

  lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3;

  lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3;

  lExpected  = 5272/(36.0 + 37.0); // = 72.2

  lResult