Re: [SlimDevices: Beta] Need help to verify duplicate detection

Phil Meyer Sat, 11 Sep 2010 06:46:12 -0700

>The new Audio::Scan module is only available in 7.6, so the only
>benefit compared to the previous Duplicate Detector version in your
>case is that cue sheets should now mostly be detected as "Incorrect
>Duplicates" instead of "Duplicates". 
>
Yes, I didn't look at incorrect duplicates, only duplicates.  I didn't see any 
.cue sheet music file issues in actual duplicates.


>Is it correct to assume that most "Incorrect duplicates" either are:
>- MP3 files encoded with LAME
>or
>- Music that have silence during the first seconds
>
All pairs of duplicates were .mp3 files; I mainly use LAME, but at least one of 
the files was FHG (I think that's iTunes - a song from my wife's library).

>It should just be a matter of which parts of the file you include. It
>got a lot better in my setup by just appending the length of the audio
>segment of the file to the checksum, this is the files listed as
>"Incorrect duplicates".
>
Yes, also comparing some other physical file aspects would de-risk duplicates.  
Without including length check, I had 2930 dups using 5000 bytes in checksum 
calculation.  But even with the extra length check, there will still be 
duplicates detected that aren't duplicates.

Is the length item the length of the music content (not file length, which 
would include metadata blocks that could change)?  What accuracy does the 
length go to - nearest second, or more accurate?

I'm not sure that the premise that "the first n bytes of a song and the length 
is unique" is good enough.  Perhaps take n bytes at start and n bytes at end 
(but more of a performance penalty for seeking in each file).

>So the wording is a bit confusing at the moment in the result shown by
>the plugin. In your setup, it means that currently you only have 24
>real duplicates and if I understand you correctly 10 of these were
>actually duplicates so it's only 14 files that are incorrectly
>identified as duplicates.
>
Yes, only 14.  But, depending on the logic that this duplicate checking may be 
used for, any false positive duplicate detections may be problematic.

>I would prefer to not use the file creation timestamp if I can avoid
>it, I'm not sure I can rely on that it's never changed. I understand
>that creation timestamp is different than modification timestamp but
>for the use case I have in mind for this a file can never change its
>identity. 
>
Okay.  There may also be problems with that working the same in each OS.

>If you have the time and is willing to try the 7.6 nightly with
>SQLite(or MySQL), I would appreciate to get the results from that to
>see if it's able to identify the duplicates correctly.
>
I'd love to, but so far failed to get 7.6 scanner to work with either SQLite or 
MySQL.  I think a recent problemmatic change was backed out, and I heard that 
SBS password protection was causing scanner problems, and pref migration seemed 
problemmatic for me last time, so I might try again later.

>> Increasing rescan time by 2 hours is a bit harsh too.
>> 
>My experience is that it's A LOT faster in 7.6 with the SQLite
>database.
>
I can't see why DB engine would make any difference here.  If it is, there's 
something seriously wrong.  Surely the time is in file I/O seek time and 
processing to calculate the checksum.  A lookup on a checksum column with a 
suitable index is negligible in relation to that?

>Also, the new Audio::Scan module will use the MD5 integrity checksum
>inside of FLAC files, so for FLAC encoded music it should be really
>fast.
>
Okay, that could be a bigger difference for me - a large proportion of my music 
library is now FLAC.
_______________________________________________
beta mailing list
[email protected]
http://lists.slimdevices.com/mailman/listinfo/beta

Re: [SlimDevices: Beta] Need help to verify duplicate detection

Reply via email to