On Jul 5, 3:12 am, "Hendrik van Rooyen" <m...@microcorp.co.za> wrote: > > Use a dispatch dict, and have each state return the next state. > Then you can use strings representing state names, and > everybody will be able to understand the code. > > toy example, not tested, nor completed: > > protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other > states} > > def state_machine(): > next_step = protocol["start"]() > while True: > next_step = protocol[next_step]() >
I've just spent about an hour looking over this code, with a few comments to inject to the thread here: - To all those suggesting the OP convert to a dispatch table, be assured that this code is well aware of this idiom. It is used HEAVILY at a macro level, picking through the various HTML states (starting a tag, reading attributes, reading body, etc.). There still are a number of cascading if-elif's within some of these states, and some of them *may* be candidates for further optimization. - There is an underlying HTMLInputStream that seems to be doing some unnecessary position bookkeeping (positionLine and positionCol). Commenting this out increases my test speed by about 13%. In my ignorance, I may be removing some important behavior, but this does not seem to be critical as I tested against a few megs of HTML source. Before blaming the tokenizer for everything, there may be more performance to be wrung from the input stream processor. For that matter, I would guess that about 90% of all HTML files that this code would process would easily fit in memory - in that case, the stream processing (and all of the attendant "if I'm not at the end of the current chunk" code) could be skipped/removed entirely. - The HTMLInputStream's charsUntil code is an already-identified bottleneck, and some re enhancements have been applied here to help out. - Run-time construction of tuple literals where the tuple members are constants can be lifted out. emitCurrentToken rebuilds this tuple every time it is called (which is a lot!): if (token["type"] in (tokenTypes["StartTag"], tokenTypes ["EndTag"], tokenTypes["EmptyTag"])): Move this tuple literal into a class constant (or if you can tolerate it, a default method argument to gain LOAD_FAST benefits - sometimes optimization isn't pretty). - These kinds of optimizations are pretty small, and only make sense if they are called frequently. Tallying which states are called in my test gives the following list in decreasing frequency. Such a list would help guide your further tuning efforts: tagNameState 194848 dataState 182179 attributeNameState 116507 attributeValueDoubleQuotedState 114931 tagOpenState 105556 beforeAttributeNameState 58612 beforeAttributeValueState 58216 afterAttributeValueState 58083 closeTagOpenState 50547 entityDataState 1673 attributeValueSingleQuotedState 1098 commentEndDashState 372 markupDeclarationOpenState 370 commentEndState 364 commentStartState 362 commentState 362 selfClosingStartTagState 359 doctypePublicIdentifierDoubleQuotedState 291 doctypeSystemIdentifierDoubleQuotedState 247 attributeValueUnQuotedState 191 doctypeNameState 32 beforeDoctypePublicIdentifierState 16 afterDoctypePublicIdentifierState 14 afterDoctypeNameState 9 doctypeState 8 beforeDoctypeNameState 8 afterDoctypeSystemIdentifierState 6 afterAttributeNameState 5 commentStartDashState 2 bogusCommentState 2 For instance, I wouldn't bother doing much tuning of the bogusCommentState. Anything called fewer than 50,000 times in this test doesn't look like it would be worth the trouble. -- Paul (Thanks to those who suggested pyparsing as an alternative, but I think this code is already beyond pyparsing in a few respects. For one thing, this code works with an input stream, in order to process large HTML files; pyparsing *only* works with an in-memory string. This code can also take advantage of some performance short cuts, knowing that it is parsing HTML; pyparsing's generic classes can't do that.) -- http://mail.python.org/mailman/listinfo/python-list