Using version 0.3, I changed to use 5000 bytes and restarted the test.

Detected: 35862
Checksum duplicates: 2930
(Incorrect duplicates: 2906)
Duplicates: 24

This took about 2 hours!!!

2930 checksum matches, of which 2906 are ignored because the length didn't 
match.

Out of those remaining 24 duplicates, there were:

14 false positives:
        7x pairs of files identified as duplicates

10 files were actual duplicates, which I can get rid of.

The report also includes my .mov file (no length check).


So, I am a bit concerned about the random nature of finding false positive 
duplicates (where checksum and length exactly match, but the songs are really 
different).  eg. I could potentially reduce 5000 bytes to 4999 bytes, and 
perhaps no duplicates would be found, or increase the checksum byte range and 
end up with more hits.

I think if there is going to be some md5 checksum calculation to reconnect 
rescanned files to persistent data, there needs to be some additional checking 
performed to eliminate more false positives.  eg. check file creation timestamp?

Increasing rescan time by 2 hours is a bit harsh too.
_______________________________________________
beta mailing list
[email protected]
http://lists.slimdevices.com/mailman/listinfo/beta

Reply via email to