En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <[EMAIL PROTECTED]> escribió:
> I'm looping through a tab-delimited file to gather statistics on fill rates, > lengths, and uniqueness. > > For the uniqueness, I made a dictionary with keys which correspond to the > field names. The values were originally lists, where I would store values > found in that field. Once I detected a duplicate, I deleted the entire > element from the dictionary. Any which remained by the end are considered > unique. Also, if the value was empty, the dictionary element was deleted and > that field considered not unique. > > A friend of mine suggested changing that dictionary of lists into a > dictionary of dictionaries, for performance reasons. As it turns out, the > speed increase was ridiculous -- a file which took 42 minutes to run dropped > down to six seconds. A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation). > Here is the excerpt of the bit of code which checks for uniqueness. It's > fully functional, so I'm just looking for any suggestions for improving it > or any comments. Note that fieldNames is a list containing all column > headers. > > #check for unique values > #if we are still tracking that field (we haven't yet > #found a duplicate value). > if fieldUnique.has_key(fieldNames[index]): > #if the current value is a duplicate > if fieldUnique[fieldNames[index]].has_key(value): > #sys.stderr.write("Field %s is not unique. Found a > duplicate value after checking %d values.\n" % (fieldNames[index], lineNum)) > #drop the whole hash element > fieldUnique.__delitem__(fieldNames[index]) > else: > #add the new value to the list > fieldUnique[fieldNames[index]][value] = 1 > - Instead of using fieldNames[index] all along the place, save it in a variable. - Your code doesn't show it, but if you are traversing the fieldNames vector like this: for index in range(len(fieldNames)): ... using fieldNames[index] ... it's better to use: for fieldName in fieldNames: ... using fieldName all along the place ... - Instead of a.has_key(b), use: b in a - Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName] (In normal code, __special__ methods are never used) #check for unique values #if we are still tracking that field (we haven't yet #found a duplicate value). if fieldName in fieldUnique: #if the current value is a duplicate if value in fieldUnique[fieldName]: #drop the whole field as it's not unique del fieldUnique[fieldName] else: #add the new value to the set fieldUnique[fieldName].add(value) else: # use a set to store the unique values fieldUnique[fieldName] = set([value]) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list