On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > On 12/20/2012 07:19 PM, larry.mart...@gmail.com wrote: > > > I have a list of tuples that contains a tool_id, a time, and a message. I > > want to select from this list all the elements where the message matches > > some string, and all the other elements where the time is within some diff > > of any matching message for that tool. > > > Here is how I am currently doing this: > > No, it's not. This is a fragment of code, without enough clues as to > > what else is going. We can guess, but that's likely to make a mess.
Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > First question is whether this code works exactly correctly? Yes, the code works. I end up with just the rows I want. > Are you only concerned about speed, not fixing features? Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > As far as I can tell, the logic that includes the time comparison is bogus. Not at all. > You don't do anything there to worry about the value of tup[2], just whether > some > item has a nearby time. Of course, I could misunderstand the spec. The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > Are you making a global called 'self' ? That name is by convention only > used in methods to designate the instance object. What's the attribute > self? Yes, self is my instance object. self.message contains the string of interest that I need to look for. > Can cdata have duplicates, and are they significant? No, it will not have duplicates. > Are you including the time building that as part of your 2 hour measurement? No, the 2 hours is just the time to run the cdata[:] = [tup for tup in cdata if determine(tup)] > Is the list sorted in any way? Yes, the list is sorted by tool and datetime. > Chances are your performance bottleneck is the doubly-nested loop. You > have a list comprehension at top-level code, and inside it calls a > function that also loops over the 600,000 items. So the inner loop gets > executed 360 billion times. You can cut this down drastically by some > judicious sorting, as well as by having a map of lists, where the map is > keyed by the tool. Thanks. I will try that. > > # record time for each message matching the specified message for each tool > > > messageTimes = {} > > You're building a dictionary; are you actually using the value (1), or > is only the key relevant? Only the keys. > A set is a dict without a value. Yes, I could use a set, but I don't think that would make it measurably faster. > But more mportantly, you never look up anything in this dictionary. So why > isn't it a list? For that matter, why don't you just use the > messageTimes list? Yes, it could be a list too. > > for row in cdata: # tool, time, message > > > if self.message in row[2]: > > > messageTimes[row[0], row[1]] = 1 > > > > > > # now pull out each message that is within the time diff for each matched > > message > > > # as well as the matched messages themselves > > > > > > def determine(tup): > > > if self.message in tup[2]: return True # matched message > > > > > > for (tool, date_time) in messageTimes: > > > if tool == tup[0]: > > > if abs(date_time-tup[1]) <= tdiff: > > > return True > > > > > > return False > > > > > > cdata[:] = [tup for tup in cdata if determine(tup)] > > > > As the code exists, there's no need to copy the list. Just do a simple > bind. This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. > > > > > > > > This code works, but it takes way too long to run - e.g. when cdata has > > 600,000 elements (which is typical for my app) it takes 2 hours for this to > > run. > > > > > > Can anyone give me some suggestions on speeding this up? -- http://mail.python.org/mailman/listinfo/python-list