2009/1/9 Daniel Kersten <[email protected]>

> Rory, definitely! The reason I am using lists at all is because the
> messages are similar and I don't know ahead of time which of the
> fields I am interested in - so I either putt hem in multiple
> dictionaries (sorted by different criteria) or in a list which is then
> searched. I'm not sure if I can use sets or not - how are they matched
> (eg for the `in` operation)?


IIRC they've got to implement __hash__


>
>
> Pablo, yes, I imagine that dictionary entries cause loads of hash
> collisions after a certain size... Thanks for the links though - may
> come in useful. From the third article (not yet read, just scanned
> through), some notes:
>  1) I already use a compiled regular expression, since its always the same.
>  2) I do not need to skip unreadable lines, as I am guaranteed valid input.
>  3) I'm not sure if I can easily multithread my application, because
> of the dependencies on other messages, but its something I'm looking
> at.
>  4) I don't actually believe my bottleneck is IO, so memory mapping
> the file may not help me much - though if needs be, I'll look at that
> later.


Re-reading the article, he only got like 0.3 seconds faster by using mmap,
dunno why but I believed it was bigger. Sorry for the noise :P
Anyway its an interesting optimization read, going from 8+min that took the
original version written in Erlang to 0.9s on python. (1M rows).

Pablo



> Gotta read it in more detail though :-)
>
>
> 2009/1/9 Pablo Martí Gamboa <[email protected]>:
> >
> >
> > 2009/1/9 Daniel Kersten <[email protected]>
> >>
> >> Hi Rory,
> >>
> >> The main slowdown is that messages depend on previous messages, so I
> >> need to match them together. This becomes slow if I have to search
> >> through too many messages.
> >>
> >> It is somewhat inefficient and I am working on optimising the
> >> algorithm as we speak, especially in these areas:
> >>  1) replacing lists with dictionaries, since lists get horribly slow
> >> as they grow.
> >
> > After certain size, operations on dict become horribly slow too [0]. I'd
> go
> > for bsddb if the dict size is too big, it offers the same interface than
> a
> > dict [1].
> >
> > [2] is an interesting read regarding optimizations applied to a program
> > relatively similar to yours, where a good chunk of the speedup came from
> > using mmap instead of the plain file interface. Also multiple processes
> will
> > help too.
> >
> > [0]
> http://mail.python.org/pipermail/python-dev/2008-December/084530.html
> > [1] http://docs.python.org/library/bsddb.html
> > [2] http://effbot.org/zone/wide-finder.htm
> >
> > Best regards,
> > Pablo
> >
> >>
> >>  2) either processing messages concurrently or doing multiple checks
> >> on a single message concurrently.
> >>  3) Diarmuid suggested perhaps use a separate process to prefetch from
> >> the log files, this may work since I don't waste time waiting on disk
> >> IO.
> >>
> >> There is one more area which will improve the speed drastically - I
> >> was lucky in that it has been decided to change the tests slightly -
> >> as a side effect, I can now make assumptions I couldn't before which
> >> will (I think) remove one of the lists which I wasn't able to
> >> dictionary-ify.
> >>
> >> Here is what currently happens, in pseudocode:
> >>
> >> for entry in log_file:
> >>    current_message = parse_using_regex(entry)
> >>    expected_messages_for_type =
> >> list_of_expected_messages[current_message.message_type]
> >>    expected_messages =
> >> expected_messages_for_type[current_message.id_dependant_on_message_type]
> >>    for message in expected_messages:
> >>        if message == current_message:
> >>           remove_from_expected_messages(message)
> >>           new_messages = get_next_expected_messages(current_message)
> >>           if new_messages is not None:
> >>               add_to_expected_messages(new_messages)
> >>               message_from_middle_operation()
> >>           else:
> >>               operation_completed()
> >>
> >> One thing to note is that the last line may add more than one message
> >> (and the remove should remove them all) because some messages can
> >> expect one of a selection of alternatives - this can now be removed, I
> >> believe.
> >>
> >> Theres more going on than that.. In fact, theres quite a lot of
> >> complexity added because messages from different "operations" are all
> >> almost the same, so to determine the next expected message(s) requires
> >> some convoluted logic. Gonna try figure out a better way, especially
> >> as I don't think I'll need to deal with alternative messages in one
> >> operation anymore.
> >>
> >> Hope that explained it somewhat. If you have any obvious suggestions
> >> on speeding up the code further - please share! Unfortunately, I can't
> >> share the code.. The code itself is sharable really, but the data
> >> operated on (messages and operations) is not and can be inferred from
> >> the code.
> >>
> >>
> >>
> >> Still, the point of my email was to show that even a naive use of
> >> cython (compiling Python code without modification) can give quite a
> >> good speed increase! (I would have simply wrote "hey i used cython and
> >> my code is faster" but I figured some background would be nice)
> >>
> >> 2009/1/9 Rory Geoghegan <[email protected]>:
> >> >
> >> > Not to pick at you, but an hour and a half sounds a bit long for 171k
> >> > entries. If you did not write the code for corporate interests, would
> >> > you mind sharing it, on something like github or bitbicket?
> >> >
> >> > I mean, I know nothing of a) the problem you are trying to solve b)
> >> > the constraints you are facing in programming a solution so I will
> >> > obstinately hold down my views that there must be a problem with
> >> > either the algorithm you employ or the way you've coded it, even in
> >> > the face of precise and clear evidence.
> >> >
> >> > --Rory
> >> >
> >> > On Fri, Jan 9, 2009 at 2:01 PM, Daniel Kersten <[email protected]>
> >> > wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I wrote a Python program to generate a report from log files. The log
> >> >> files are generated by a test-suite used to test a java program. The
> >> >> report gives me details about how many messages (its a message based
> >> >> program being tested) were processed, how long each operation (which
> >> >> may consist of processing one or more message) took, counts of which
> >> >> operations passed or didnt pass, messages processed per second etc.
> >> >> The idea is that the main program can be run over a weekend or week
> or
> >> >> whatever and the log files from the test suite are checked by my
> >> >> Python program.
> >> >>
> >> >> The log files can be huge.
> >> >>
> >> >> Yesterday, I ran my program on a log file with 171K entries - it took
> >> >> an hour and a half! (This is why I'm interested in the sppedup patch)
> >> >> There are some algorithmic changes which would be beneficial, but
> that
> >> >> would require significant code restructuring which, right now, I dont
> >> >> have time for. So I'm looking for simpler ways.
> >> >>
> >> >> I decided to give Cython (cython.org) a shot, since it compiles
> Python
> >> >> code to C. IT supports almost all of Pythons constructs, the only
> >> >> major limitation (IMHO - that is, the only feature I really use which
> >> >> Cython does not support) being nested functions and lambdas. Removing
> >> >> them from my code slowed it down a small bit, due to one of my
> >> >> functions accessing a variable from the outer scope, so I couldn't
> >> >> simply move it into the global scope - and I couldn't pass it as an
> >> >> argument because I was storing the function as a callback.
> >> >> Besides that, I made NO other changes to my Python code.
> >> >>
> >> >> The code that took 1 hour and 32 minutes to execute with the pure
> >> >> python version completed in 48 minutes!!
> >> >>
> >> >> This can be improved more still, by strategically declaring functions
> >> >> and variables as C types.
> >> >>
> >> >> Just thought I'd share, in case someone else needs more performance
> >> >> out of their Python and doesn't know where to turn.
> >> >>
> >> >> --
> >> >> Daniel Kersten.
> >> >> Leveraging dynamic paradigms since the synergies of 1985.
> >> >>
> >> >> >
> >> >>
> >> >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Daniel Kersten.
> >> Leveraging dynamic paradigms since the synergies of 1985.
> >>
> >>
> >
> >
> >
> > --
> > Pablo Martí
> > http://www.linkedin.com/in/pmarti || http://www.warp.es
> > python -c "print '706d6172746940776172702e6573'.decode('hex')"
> >
> >
> > >
> >
>
>
>
> --
> Daniel Kersten.
> Leveraging dynamic paradigms since the synergies of 1985.
>
> >
>


-- 
Pablo Martí
http://www.linkedin.com/in/pmarti || http://www.warp.es
python -c "print '706d6172746940776172702e6573'.decode('hex')"

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Python Ireland" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.ie/group/pythonireland?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to