RE: Duplicate matching
Hi Greg, Instead of using the filename to determine duplicate audio files have you considered using an audio fingerprint? I have used this software in the past to automatically tag my music. https://picard.musicbrainz.org/ “Picard uses AcoustID audio fingerprints, allowing files to be identified by the actual music, even if they have no metadata” Apparently it uses http://acoustid.org/ which is an open source library. Regards Adrian Halid From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On Behalf Of Greg Keogh Sent: Saturday, 29 November 2014 6:46 AM To: ozDotNet Subject: Duplicate matching Folks, I was about this write some utility code to search through my 20,000 audio files looking for probable duplicates. I say probable because I found file names like these: Lovelock - Trumpet Concerto (SSO Concert).mp3 Trumpet Concerto (William Lovelock).mp3 There are many other duplicates with rearranged, abbreviated or misspelt words in the names. I was about to click New Project and start typing but I suddenly realised I had no idea what algorithm to use to find probable duplicates and rate them. Has anyone done this sort of thing before or know where to find a description of a suitable algorithm? Greg K
Re: Duplicate matching
Instead of using the filename to determine duplicate audio files have you considered using an audio fingerprint? ... Apparently it uses http://acoustid.org/ which is an open source library. This is an interesting lateral-thinking idea. That's an ambitious and scientifically interesting project. I can't submit 90GB of music to their web service, but if the algorithm can be run locally to generate the fingerprints then it would efficient to lookup their database -- *Greg K*
RE: Duplicate matching
I am pretty sure the fingerprints are calculated on your local machine. It would only be the finger print that you send to the web service. This is how the Picard application works. Regards Adrian Halid From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On Behalf Of Greg Keogh Sent: Monday, 1 December 2014 8:35 AM To: ozDotNet Subject: Re: Duplicate matching Instead of using the filename to determine duplicate audio files have you considered using an audio fingerprint? ... Apparently it uses http://acoustid.org/ which is an open source library. This is an interesting lateral-thinking idea. That's an ambitious and scientifically interesting project. I can't submit 90GB of music to their web service, but if the algorithm can be run locally to generate the fingerprints then it would efficient to lookup their database -- Greg K
Re: Duplicate matching
Hi Greg, Please find following what I have used in the past. It is very expensive, but I can not see a better way of doing it. It returns an integer which is the sum of: - number of times the same letter appears in both strings - 10 times the number of times the same two letters appears in both strings - 100 times the number of times the same three letters appears in both strings Once you get your results, sort them, the most similar strings will have higher results. I used this many years ago and not used it since. There may be (far) better ways to do this. Regards Greg Harris public static string CleanStr ( this string aText ) { int diff = 'A' - 'a'; StringBuilder result = new StringBuilder(); foreach ( char ch in aText ) { if (( ch = '0' ch = '9' ) || ( ch = 'A' ch = 'Z' ) ) { result.Append(ch); } else { if ( ch = 'a' ch = 'z' ) { result.Append((char)(ch+diff)); } } } return result.ToString(); } /// summary /// Do a sounds like compare, the higher the result, the more the words/phrases sound the same /// /summary /// param name=aStr1First word / phrase/param /// param name=aStr2Second word / phrase/param /// returnsScore/returns public static double CompareSoundsLike ( this string aStr1, string aStr2 ) { aStr1 = aStr1.CleanStr(); aStr2 = aStr2.CleanStr(); double result = 0; for (int i = 0; i aStr1.Length; i++) { char outerChar = aStr1[i]; for (int j = 0; j aStr2.Length; j++) { char innerInner = aStr2[j]; if ( outerChar == innerInner ) { result++; if ( ( i aStr1.Length-1 ) ( j aStr2.Length-1 ) ( aStr1[i+1] == aStr2[j+1] ) ) result += 10 ; if ( ( i aStr1.Length-2 ) ( j aStr2.Length-2 ) ( aStr1[i+2] == aStr2[j+2] ) ) result += 100; } } } return result / ( aStr1.Length + aStr2.Length ); } [TestMethod] public void Test_10_Compare1() { // 123456 string lTestLine1 = qwerty; string lTestLine2 = QWERTY; double lExpected = 456/(6+6); double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); } [TestMethod] public void Test_10_Compare2() { // 123456789-123456789-123456789-123456789-12xxx string lTestLine1 = The quick brown fox jumped over the !@#$ dog!; string lTestLine2 = T H E - Q U I C K - B R O W N - F O X - J U M P E D - O V E R - T H E - D O G; double lExpected = 3856.0/(32.0 + 32.0); double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); } On Sat, Nov 29, 2014 at 9:46 AM, Greg Keogh g...@mira.net wrote: Folks, I was about this write some utility code to search through my 20,000 audio files looking for probable duplicates. I say probable because I found file names like these: Lovelock - Trumpet Concerto (SSO Concert).mp3 Trumpet Concerto (William Lovelock).mp3 There are many other duplicates with rearranged, abbreviated or misspelt words in the names. I was about to click New Project and start typing but I suddenly realised I had no idea what algorithm to use to find probable duplicates and rate them. Has anyone done this sort of thing before or know where to find a description of a suitable algorithm? *Greg K*
Re: Duplicate matching
Hi Greg, I should look at my code before I write comments from memory... The result is a *double *value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings *Which is then divided by the length of the two strings to sort of “normalise” the result.* Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a truncation, so will get a poor match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto.mp3; lExpected = 3237/(36.0 + 26.0); // = 52.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on William and a little else... lTestLine1 = Trumpet Concerto (William Lovelock).mp3; lTestLine2 = The Complete Works of William Shakespeare.txt; lExpected = 1202/(33.0 + 39.0); // = 16.69 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on each of the letters, but no double letters lTestLine1 = QWERTY; lTestLine2 = ytrewq; lExpected = 6/(6.0 + 6.0); // = 0.5 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on nothing lTestLine1 = QWERTY; lTestLine2 = ASDFGHJKL; lExpected = 0/(6.0 + 9.0); // = 0.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); } On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, Please find following what I have used in the past. It is very expensive, but I can not see a better way of doing it. It returns an integer which is the sum of: - number of times the same letter appears in both strings - 10 times the number of times the same two letters appears in both strings - 100 times the number of times the same three letters appears in both strings Once you get your results, sort them, the most similar strings will have higher results. I used this many years ago and not used it since. There may be (far) better ways to do this. Regards Greg Harris public static string CleanStr ( this string aText ) { int diff = 'A' - 'a'; StringBuilder result = new StringBuilder(); foreach ( char ch in aText ) { if (( ch = '0' ch = '9' ) || ( ch = 'A' ch = 'Z' ) ) { result.Append(ch); } else { if ( ch = 'a' ch = 'z' ) { result.Append((char)(ch+diff)); } } }
Re: Duplicate matching
Thanks Greg H, the weighting is a very interesting idea. I'm running some simple experiments now with a word list and an inverted list of file names, just to help me picture the problem in my head. The problem with a weighting comparison is that I don't know what to compare with what, comparing 20,000 file names with every other one might run into the next ice age. However, I like the weighting idea, so I might finish up with a hybrid algorithm. I'll let you know if anything interesting arises out of this -- *Greg K* On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, I should look at my code before I write comments from memory... The result is a *double *value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings *Which is then divided by the length of the two strings to sort of “normalise” the result.* Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a truncation, so will get a poor match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto.mp3; lExpected = 3237/(36.0 + 26.0); // = 52.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on William and a little else... lTestLine1 = Trumpet Concerto (William Lovelock).mp3; lTestLine2 = The Complete Works of William Shakespeare.txt; lExpected = 1202/(33.0 + 39.0); // = 16.69 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on each of the letters, but no double letters lTestLine1 = QWERTY; lTestLine2 = ytrewq; lExpected = 6/(6.0 + 6.0); // = 0.5 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on nothing lTestLine1 = QWERTY; lTestLine2 = ASDFGHJKL; lExpected = 0/(6.0 + 9.0); // = 0.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); } On Sat, Nov 29, 2014 at 10:16 AM, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, Please find following what I have used in the past. It is very expensive, but I can not see a better way of doing it. It returns an integer which is the sum of: - number of times the same letter appears in both strings - 10 times the number of times the same two letters appears in both strings - 100 times the number of times the same three letters appears in both strings Once you get your results, sort them, the most similar strings will have
Re: Duplicate matching
Am curious, is the idea of the exercise to write your own code to solve the problem, or to solve the problem? I've used Treesize pro to find file duplicates in the past. Also have used Directory Opus to find duplicates. Great for finding identical files with different names. Probably won't help if the songs are the same song but from a different source. Your file name pattern matching code would be the way to go. (Which is also the case if this is a programming exercise :) Maybe I'm a lazy coder, I usually look for someone elses product/code before writing my own. I can see the benefit of writing your own too. On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote: Thanks Greg H, the weighting is a very interesting idea. I'm running some simple experiments now with a word list and an inverted list of file names, just to help me picture the problem in my head. The problem with a weighting comparison is that I don't know what to compare with what, comparing 20,000 file names with every other one might run into the next ice age. However, I like the weighting idea, so I might finish up with a hybrid algorithm. I'll let you know if anything interesting arises out of this -- *Greg K* On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, I should look at my code before I write comments from memory... The result is a *double *value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings *Which is then divided by the length of the two strings to sort of “normalise” the result.* Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a truncation, so will get a poor match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto.mp3; lExpected = 3237/(36.0 + 26.0); // = 52.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on William and a little else... lTestLine1 = Trumpet Concerto (William Lovelock).mp3; lTestLine2 = The Complete Works of William Shakespeare.txt; lExpected = 1202/(33.0 + 39.0); // = 16.69 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on each of the letters, but no double letters lTestLine1 = QWERTY; lTestLine2 = ytrewq; lExpected = 6/(6.0 + 6.0); // = 0.5 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on nothing lTestLine1 = QWERTY; lTestLine2 = ASDFGHJKL; lExpected = 0/(6.0 + 9.0); // = 0.0
Re: Duplicate matching
Hi Stephen, I wrote a utility in Framework 1.0 that finds duplicate files by content (builds a dictionary of checksums). In this case the files with similar names might be the same recording at different bitrates, making them binary different. So it's a bit fuzzy what I'm looking for. Off the cuff I thought this might be a non-trivial algorithm, like a soundex for whole file names and there wouldn't be an existing tool for the exact purpose (I hope I'm wrong!) -- *GK* On 29 November 2014 at 12:29, Stephen Price step...@perthprojects.com wrote: Am curious, is the idea of the exercise to write your own code to solve the problem, or to solve the problem? I've used Treesize pro to find file duplicates in the past. Also have used Directory Opus to find duplicates. Great for finding identical files with different names. Probably won't help if the songs are the same song but from a different source. Your file name pattern matching code would be the way to go. (Which is also the case if this is a programming exercise :) Maybe I'm a lazy coder, I usually look for someone elses product/code before writing my own. I can see the benefit of writing your own too. On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote: Thanks Greg H, the weighting is a very interesting idea. I'm running some simple experiments now with a word list and an inverted list of file names, just to help me picture the problem in my head. The problem with a weighting comparison is that I don't know what to compare with what, comparing 20,000 file names with every other one might run into the next ice age. However, I like the weighting idea, so I might finish up with a hybrid algorithm. I'll let you know if anything interesting arises out of this -- *Greg K* On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, I should look at my code before I write comments from memory... The result is a *double *value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings *Which is then divided by the length of the two strings to sort of “normalise” the result.* Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a truncation, so will get a poor match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto.mp3; lExpected = 3237/(36.0 + 26.0); // = 52.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on William and a little else... lTestLine1 = Trumpet Concerto (William Lovelock).mp3; lTestLine2 = The Complete Works of William Shakespeare.txt; lExpected = 1202/(33.0 + 39.0); // = 16.69 lResult=
RE: Duplicate matching
Yes, I use Treesize (Professional) when I need to discover files on disks. I’ve had to do it remotely using TeamViewer – hence the Pro version – but a free version and also a trial of the Pro version are available as I recall. It’s worth a try. But I’m interested in the algorithm and the code, since it might be useful within a program of mine and also in a personal scenario similar to Greg K’s. _ Ian Thomas Albert Park, Victoria From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com] On Behalf Of Stephen Price Sent: Saturday, November 29, 2014 12:30 PM To: ozDotNet Subject: Re: Duplicate matching Am curious, is the idea of the exercise to write your own code to solve the problem, or to solve the problem? I've used Treesize pro to find file duplicates in the past. Also have used Directory Opus to find duplicates. Great for finding identical files with different names. Probably won't help if the songs are the same song but from a different source. Your file name pattern matching code would be the way to go. (Which is also the case if this is a programming exercise :) Maybe I'm a lazy coder, I usually look for someone elses product/code before writing my own. I can see the benefit of writing your own too. http://t.signaledue.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v4LGzzVdDZcj8qlRZHN5w6vp0g4p7Cf96836-01?si=6200614728499200pi=27dbf3f9-42ef-41ec-f206-9d6dc151c2c2 On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote: Thanks Greg H, the weighting is a very interesting idea. I'm running some simple experiments now with a word list and an inverted list of file names, just to help me picture the problem in my head. The problem with a weighting comparison is that I don't know what to compare with what, comparing 20,000 file names with every other one might run into the next ice age. However, I like the weighting idea, so I might finish up with a hybrid algorithm. I'll let you know if anything interesting arises out of this -- Greg K On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, I should look at my code before I write comments from memory... The result is a double value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings Which is then divided by the length of the two strings to sort of “normalise” the result. Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a truncation, so will get a poor match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto.mp3; lExpected = 3237/(36.0 + 26.0); // = 52.2 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This will get a match on William
Re: Duplicate matching
Beyond compare has a dedicated viewer for MP3 files but it looks like it compares the tags not the actual audio. I think for comparing files and folders it's awesome but not sure if it can be used to find the duplicates in a single folder. Also it would need the MP3 tags to be correct (which there are tools for and I think this list has discussed music tools previously) As a side note I deleted all of my music and use music subscriptions now. Switched between several and have finally settled (for now) on Google music pass. So I don't have the problem of duplicate music now :) Sent from my iPhone On 29 Nov 2014, at 9:54 am, ILT (O) il.tho...@outlook.com wrote: Yes, I use Treesize (Professional) when I need to discover files on disks. I’ve had to do it remotely using TeamViewer – hence the Pro version – but a free version and also a trial of the Pro version are available as I recall. It’s worth a try. But I’m interested in the algorithm and the code, since it might be useful within a program of mine and also in a personal scenario similar to Greg K’s. -- Ian Thomas Albert Park, Victoria *From:* ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com ozdotnet-boun...@ozdotnet.com] *On Behalf Of *Stephen Price *Sent:* Saturday, November 29, 2014 12:30 PM *To:* ozDotNet *Subject:* Re: Duplicate matching Am curious, is the idea of the exercise to write your own code to solve the problem, or to solve the problem? I've used Treesize pro to find file duplicates in the past. Also have used Directory Opus to find duplicates. Great for finding identical files with different names. Probably won't help if the songs are the same song but from a different source. Your file name pattern matching code would be the way to go. (Which is also the case if this is a programming exercise :) Maybe I'm a lazy coder, I usually look for someone elses product/code before writing my own. I can see the benefit of writing your own too. On Sat, Nov 29, 2014 at 8:55 AM, Greg Keogh g...@mira.net wrote: Thanks Greg H, the weighting is a very interesting idea. I'm running some simple experiments now with a word list and an inverted list of file names, just to help me picture the problem in my head. The problem with a weighting comparison is that I don't know what to compare with what, comparing 20,000 file names with every other one might run into the next ice age. However, I like the weighting idea, so I might finish up with a hybrid algorithm. I'll let you know if anything interesting arises out of this -- *Greg K* On 29 November 2014 at 11:17, Greg Harris g...@harrisconsultinggroup.com wrote: Hi Greg, I should look at my code before I write comments from memory... The result is a *double *value being the sum of: · number of times the same letter appears in both strings · 10 times the number of times the same two letters appears in both strings · 100 times the number of times the same three letters appears in both strings *Which is then divided by the length of the two strings to sort of “normalise” the result.* Mixed case is ignored, only compares letters A-Z and 0-9, everything else is excluded. I added a Greg unit test to better show the results which is following… Regards Greg Harris [TestMethod] public void Test_10_Compare3_ForGregKeogh() { // 123456789-123456789-123456789-123456789-12456 string lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; string lTestLine2 = Trumpet Concerto (William Lovelock).mp3; double lExpected = 3033/(36.0 + 33.0); // = 43.9 double lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, so will get the best posible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of exactly the same string, with case difference, which is ignored, // so will also get the best possible match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = LOVELOCK - TRUMPET CONCERTO (SSO CONCERT).mp3; lExpected = 5256/(36.0 + 36.0); // = 73.0 lResult= lTestLine1.CompareSoundsLike( lTestLine2 ); Assert.AreEqualdouble( lExpected, lResult ); // This is an example of a spelling/typing mistake, so will get a very good match lTestLine1 = Lovelock - Trumpet Concerto (SSO Concert).mp3; lTestLine2 = Lovelock - Trumpet Concerto (SoSo Concert).mp3; lExpected = 5272/(36.0 + 37.0); // = 72.2 lResult