On 30 Nov 2013, at 11:51 pm, Mike Abdullah <mabdul...@karelia.com> wrote:

> Anything short of a cryptographic hash is unsuitable for you here. You’re 
> living dangerously, asking for trouble when a collision does happen.


So, I’ve written a test which scans a chosen folder, hashing each file (all 
files, not just images) and storing the hashes. If a hash collides, I store it 
in a list and log it.

Scanning my entire hard drive (excluding hidden files), which took several 
hours, sure I had plenty of collisions - but absolutely no false ones - they 
all turned out to be genuine duplicates of existing files. This is using the 
FNV-1a 64-bit hash + length approach.

I’m thinking this is good enough, really. The odds of a particular user having 
two different image files that collide, and happening to add those exact images 
at once to our app must be astronomically low. Talk me out of it :)

Here’s the code, if anyone is interested enough to want to reproduce these 
results, or spot a flaw in my test:

The actual hash, which is part of a category on NSData:

- (u_int64_t)   fnvHash64
{
        unsigned char* p = (unsigned char*)[self bytes];
        
        u_int64_t       h = 14695981039346656037U;
        NSUInteger i = 0, len = [self length];
        
        while( i < len )
                h = p[i++] ^ (h * 1099511628211U);
        
        return h;
}


The hash class is much as Kyle suggested:

@interface GCDataHash   : NSObject <NSCopying>
{
@private
        u_int64_t               mHash;
        NSUInteger      mLength;
}

@property (nonatomic, readonly) NSUInteger      length;

- (instancetype) initWithData:(NSData*) data;

- (BOOL)                isEqual:(id) object;
- (NSUInteger)  hash;

@end

@implementation GCDataHash

@synthesize length = mLength;

- (instancetype) initWithData:(NSData*) data
{
        self = [super init];
        if( self )
        {
                mHash = [data fnvHash64];
                mLength = [data length];
        }
        
        return self;
}


- (NSUInteger)  hash
{
        return mHash;
}


- (BOOL)                isEqual:(id) object
{
        if( object == self )
                return YES;
        
        if(![object isKindOfClass:[GCDataHash class]])
                return NO;
        
        GCDataHash* other = (GCDataHash*)object;
        
        return other->mLength == mLength && other->mHash == mHash;
}


- (NSString*)   description
{
        return [NSString stringWithFormat:@"%@ length=%lu, hash=%lu", [super 
description], mLength, [self hash]];
}


- (id)          copyWithZone:(NSZone *)zone
{
        return [self retain];
}

This is the test, which expects a directory URL. ivars ‘collisions' is a 
mutable array, ‘hashes' a mutable dictionary. With as large a data set as I can 
reasonably muster (which is my entire hard disk), it returns 0 collisions. It's 
set to go into packages so it would scan all the internals of my iPhoto 
library, among others.

- (void)                        runTestWithURL:(NSURL*) url
{
        [collisions removeAllObjects];
        [hashes removeAllObjects];
        
        
        NSDirectoryEnumerator* dirEnum = [[NSFileManager defaultManager] 
enumeratorAtURL:url
                                                                                
                 includingPropertiesForKeys:nil
                                                                                
                                                        
options:NSDirectoryEnumerationSkipsHiddenFiles
                                                                                
                                           errorHandler:^BOOL(NSURL *url, 
NSError *error)
                                                                                
                                                {
                                                                                
                                                        return YES;
                                                                                
                                                }];
        
        NSUInteger count = 0;
        
        for( NSURL* file in dirEnum )
        {
                @autoreleasepool
                {
                        NSError* error = nil;
                        NSData* data = [NSData dataWithContentsOfURL:file
                                                                                
                 options:NSDataReadingMappedIfSafe | NSDataReadingUncached
                                                                                
                   error:&error];
                        
                        if( data )
                        {
                                GCDataHash* hash = [[GCDataHash alloc] 
initWithData:data];
                                
                                // check for hash collision
                                
                                NSURL* existing = [hashes objectForKey:hash];
                                
                                if( existing )
                                {
                                        // the hash collided. See if the data 
is actually different, or whether it's just a copy 
                                        // of an existing file. 

                                        NSData* otherData = [NSData 
dataWithContentsOfURL:existing];
                                         if(![otherData isEqualToData:data])
                                         { 
                                                // we have a bona fide 
collision - hash matches but data doesn't. Save that. 

                                                NSDictionary* crunch = @{ hash 
: file };
                                                 [collisions addObject:crunch]; 

                                                NSLog(@"%lu: collision for path 
'%@', collided with '%@', hash = %@", collisions.count + 1, file, existing, 
hash );
                                         } 
                                }
                                else
                                        [hashes setObject:file forKey:hash];
                                
                                [hash release];
                                
                                ++count;
                        }
                }
        }
        
        NSLog(@"hashes tested: %lu, collided: %lu", count, collisions.count );
}


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to