I'm looping through a tab-delimited file to gather statistics on fill rates, lengths, and uniqueness.
For the uniqueness, I made a dictionary with keys which correspond to the field names. The values were originally lists, where I would store values found in that field. Once I detected a duplicate, I deleted the entire element from the dictionary. Any which remained by the end are considered unique. Also, if the value was empty, the dictionary element was deleted and that field considered not unique. A friend of mine suggested changing that dictionary of lists into a dictionary of dictionaries, for performance reasons. As it turns out, the speed increase was ridiculous -- a file which took 42 minutes to run dropped down to six seconds. Here is the excerpt of the bit of code which checks for uniqueness. It's fully functional, so I'm just looking for any suggestions for improving it or any comments. Note that fieldNames is a list containing all column headers. #check for unique values #if we are still tracking that field (we haven't yet #found a duplicate value). if fieldUnique.has_key(fieldNames[index]): #if the current value is a duplicate if fieldUnique[fieldNames[index]].has_key(value): #sys.stderr.write("Field %s is not unique. Found a duplicate value after checking %d values.\n" % (fieldNames[index], lineNum)) #drop the whole hash element fieldUnique.__delitem__(fieldNames[index]) else: #add the new value to the list fieldUnique[fieldNames[index]][value] = 1
-- http://mail.python.org/mailman/listinfo/python-list