hi, i am doing a series of very simple string operations on lines i am reading from a large file (~15 million lines). i store the result of these operations in a simple instance of a class, and then put it inside of a hash table. i found that this is unusually slow... for example:
class myclass(object): __slots__ = ("a", "b", "c", "d") def __init__(self, a, b, c, d): self.a = a self.b = b self.c = c self.d = d def __str__(self): return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d) def __hash__(self): return hash((self.a, self.b, self.c, self.d)) def __eq__(self, other): return (self.a == other.a and \ self.b == other.b and \ self.c == other.c and \ self.d == other.d) __repr__ = __str__ n = 15000000 table = defaultdict(int) t1 = time.time() for k in range(1, n): myobj = myclass('a' + str(k), 'b', 'c', 'd') table[myobj] = 1 t2 = time.time() print "time: ", float((t2-t1)/60.0) this takes a very long time to run: 11 minutes!. for the sake of the example i am not reading anything from file here but in my real code i do. also, i do 'a' + str(k) but in my real code this is some simple string operation on the line i read from the file. however, i found that the above code shows the real bottle neck, since reading my file into memory (using readlines()) takes only about 4 seconds. i then have to iterate over these lines, but i still think that is more efficient than the 'for line in file' approach which is even slower. in the above code is there a way to optimize the creation of the class instances ? i am using defaultdicts instead of ordinary ones so i dont know how else to optimize that part of the code. is there a way to perhaps optimize the way the class is written? if takes only 3 seconds to read in 15 million lines into memory it doesnt make sense to me that making them into simple objects while at it would take that much more... -- http://mail.python.org/mailman/listinfo/python-list