Definitely rely on dictionaries and sets more than lists. Finding in a list is O(n) while dicts and sets (finding membership & etc) is O(log(n)), which will speed up your code for very little effort.
--Rory On Fri, Jan 9, 2009 at 3:24 PM, Daniel Kersten <[email protected]> wrote: > > Hi Rory, > > The main slowdown is that messages depend on previous messages, so I > need to match them together. This becomes slow if I have to search > through too many messages. > > It is somewhat inefficient and I am working on optimising the > algorithm as we speak, especially in these areas: > 1) replacing lists with dictionaries, since lists get horribly slow > as they grow. > 2) either processing messages concurrently or doing multiple checks > on a single message concurrently. > 3) Diarmuid suggested perhaps use a separate process to prefetch from > the log files, this may work since I don't waste time waiting on disk > IO. > > There is one more area which will improve the speed drastically - I > was lucky in that it has been decided to change the tests slightly - > as a side effect, I can now make assumptions I couldn't before which > will (I think) remove one of the lists which I wasn't able to > dictionary-ify. > > Here is what currently happens, in pseudocode: > > for entry in log_file: > current_message = parse_using_regex(entry) > expected_messages_for_type = > list_of_expected_messages[current_message.message_type] > expected_messages = > expected_messages_for_type[current_message.id_dependant_on_message_type] > for message in expected_messages: > if message == current_message: > remove_from_expected_messages(message) > new_messages = get_next_expected_messages(current_message) > if new_messages is not None: > add_to_expected_messages(new_messages) > message_from_middle_operation() > else: > operation_completed() > > One thing to note is that the last line may add more than one message > (and the remove should remove them all) because some messages can > expect one of a selection of alternatives - this can now be removed, I > believe. > > Theres more going on than that.. In fact, theres quite a lot of > complexity added because messages from different "operations" are all > almost the same, so to determine the next expected message(s) requires > some convoluted logic. Gonna try figure out a better way, especially > as I don't think I'll need to deal with alternative messages in one > operation anymore. > > Hope that explained it somewhat. If you have any obvious suggestions > on speeding up the code further - please share! Unfortunately, I can't > share the code.. The code itself is sharable really, but the data > operated on (messages and operations) is not and can be inferred from > the code. > > > > Still, the point of my email was to show that even a naive use of > cython (compiling Python code without modification) can give quite a > good speed increase! (I would have simply wrote "hey i used cython and > my code is faster" but I figured some background would be nice) > > 2009/1/9 Rory Geoghegan <[email protected]>: >> >> Not to pick at you, but an hour and a half sounds a bit long for 171k >> entries. If you did not write the code for corporate interests, would >> you mind sharing it, on something like github or bitbicket? >> >> I mean, I know nothing of a) the problem you are trying to solve b) >> the constraints you are facing in programming a solution so I will >> obstinately hold down my views that there must be a problem with >> either the algorithm you employ or the way you've coded it, even in >> the face of precise and clear evidence. >> >> --Rory >> >> On Fri, Jan 9, 2009 at 2:01 PM, Daniel Kersten <[email protected]> wrote: >>> >>> Hi all, >>> >>> I wrote a Python program to generate a report from log files. The log >>> files are generated by a test-suite used to test a java program. The >>> report gives me details about how many messages (its a message based >>> program being tested) were processed, how long each operation (which >>> may consist of processing one or more message) took, counts of which >>> operations passed or didnt pass, messages processed per second etc. >>> The idea is that the main program can be run over a weekend or week or >>> whatever and the log files from the test suite are checked by my >>> Python program. >>> >>> The log files can be huge. >>> >>> Yesterday, I ran my program on a log file with 171K entries - it took >>> an hour and a half! (This is why I'm interested in the sppedup patch) >>> There are some algorithmic changes which would be beneficial, but that >>> would require significant code restructuring which, right now, I dont >>> have time for. So I'm looking for simpler ways. >>> >>> I decided to give Cython (cython.org) a shot, since it compiles Python >>> code to C. IT supports almost all of Pythons constructs, the only >>> major limitation (IMHO - that is, the only feature I really use which >>> Cython does not support) being nested functions and lambdas. Removing >>> them from my code slowed it down a small bit, due to one of my >>> functions accessing a variable from the outer scope, so I couldn't >>> simply move it into the global scope - and I couldn't pass it as an >>> argument because I was storing the function as a callback. >>> Besides that, I made NO other changes to my Python code. >>> >>> The code that took 1 hour and 32 minutes to execute with the pure >>> python version completed in 48 minutes!! >>> >>> This can be improved more still, by strategically declaring functions >>> and variables as C types. >>> >>> Just thought I'd share, in case someone else needs more performance >>> out of their Python and doesn't know where to turn. >>> >>> -- >>> Daniel Kersten. >>> Leveraging dynamic paradigms since the synergies of 1985. >>> >>> > >>> >> >> > >> > > > > -- > Daniel Kersten. > Leveraging dynamic paradigms since the synergies of 1985. > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Python Ireland" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.ie/group/pythonireland?hl=en -~----------~----~----~----~------~----~------~--~---
