Hi Rory,

The main slowdown is that messages depend on previous messages, so I
need to match them together. This becomes slow if I have to search
through too many messages.

It is somewhat inefficient and I am working on optimising the
algorithm as we speak, especially in these areas:
 1) replacing lists with dictionaries, since lists get horribly slow
as they grow.
 2) either processing messages concurrently or doing multiple checks
on a single message concurrently.
 3) Diarmuid suggested perhaps use a separate process to prefetch from
the log files, this may work since I don't waste time waiting on disk
IO.

There is one more area which will improve the speed drastically - I
was lucky in that it has been decided to change the tests slightly -
as a side effect, I can now make assumptions I couldn't before which
will (I think) remove one of the lists which I wasn't able to
dictionary-ify.

Here is what currently happens, in pseudocode:

for entry in log_file:
    current_message = parse_using_regex(entry)
    expected_messages_for_type =
list_of_expected_messages[current_message.message_type]
    expected_messages =
expected_messages_for_type[current_message.id_dependant_on_message_type]
    for message in expected_messages:
        if message == current_message:
           remove_from_expected_messages(message)
           new_messages = get_next_expected_messages(current_message)
           if new_messages is not None:
               add_to_expected_messages(new_messages)
               message_from_middle_operation()
           else:
               operation_completed()

One thing to note is that the last line may add more than one message
(and the remove should remove them all) because some messages can
expect one of a selection of alternatives - this can now be removed, I
believe.

Theres more going on than that.. In fact, theres quite a lot of
complexity added because messages from different "operations" are all
almost the same, so to determine the next expected message(s) requires
some convoluted logic. Gonna try figure out a better way, especially
as I don't think I'll need to deal with alternative messages in one
operation anymore.

Hope that explained it somewhat. If you have any obvious suggestions
on speeding up the code further - please share! Unfortunately, I can't
share the code.. The code itself is sharable really, but the data
operated on (messages and operations) is not and can be inferred from
the code.



Still, the point of my email was to show that even a naive use of
cython (compiling Python code without modification) can give quite a
good speed increase! (I would have simply wrote "hey i used cython and
my code is faster" but I figured some background would be nice)

2009/1/9 Rory Geoghegan <[email protected]>:
>
> Not to pick at you, but an hour and a half sounds a bit long for 171k
> entries. If you did not write the code for corporate interests, would
> you mind sharing it, on something like github or bitbicket?
>
> I mean, I know nothing of a) the problem you are trying to solve b)
> the constraints you are facing in programming a solution so I will
> obstinately hold down my views that there must be a problem with
> either the algorithm you employ or the way you've coded it, even in
> the face of precise and clear evidence.
>
> --Rory
>
> On Fri, Jan 9, 2009 at 2:01 PM, Daniel Kersten <[email protected]> wrote:
>>
>> Hi all,
>>
>> I wrote a Python program to generate a report from log files. The log
>> files are generated by a test-suite used to test a java program. The
>> report gives me details about how many messages (its a message based
>> program being tested) were processed, how long each operation (which
>> may consist of processing one or more message) took, counts of which
>> operations passed or didnt pass, messages processed per second etc.
>> The idea is that the main program can be run over a weekend or week or
>> whatever and the log files from the test suite are checked by my
>> Python program.
>>
>> The log files can be huge.
>>
>> Yesterday, I ran my program on a log file with 171K entries - it took
>> an hour and a half! (This is why I'm interested in the sppedup patch)
>> There are some algorithmic changes which would be beneficial, but that
>> would require significant code restructuring which, right now, I dont
>> have time for. So I'm looking for simpler ways.
>>
>> I decided to give Cython (cython.org) a shot, since it compiles Python
>> code to C. IT supports almost all of Pythons constructs, the only
>> major limitation (IMHO - that is, the only feature I really use which
>> Cython does not support) being nested functions and lambdas. Removing
>> them from my code slowed it down a small bit, due to one of my
>> functions accessing a variable from the outer scope, so I couldn't
>> simply move it into the global scope - and I couldn't pass it as an
>> argument because I was storing the function as a callback.
>> Besides that, I made NO other changes to my Python code.
>>
>> The code that took 1 hour and 32 minutes to execute with the pure
>> python version completed in 48 minutes!!
>>
>> This can be improved more still, by strategically declaring functions
>> and variables as C types.
>>
>> Just thought I'd share, in case someone else needs more performance
>> out of their Python and doesn't know where to turn.
>>
>> --
>> Daniel Kersten.
>> Leveraging dynamic paradigms since the synergies of 1985.
>>
>> >
>>
>
> >
>



-- 
Daniel Kersten.
Leveraging dynamic paradigms since the synergies of 1985.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Python Ireland" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.ie/group/pythonireland?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to