On Jul 5, 3:12 am, "Hendrik van Rooyen" <m...@microcorp.co.za> wrote:
>
> Use a dispatch dict, and have each state return the next state.
> Then you can use strings representing state names, and
> everybody will be able to understand the code.
>
> toy example, not tested, nor completed:
>
> protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other
> states}
>
> def state_machine():
>     next_step = protocol["start"]()
>     while True:
>         next_step = protocol[next_step]()
>

I've just spent about an hour looking over this code, with a few
comments to inject to the thread here:

- To all those suggesting the OP convert to a dispatch table, be
assured that this code is well aware of this idiom.  It is used
HEAVILY at a macro level, picking through the various HTML states
(starting a tag, reading attributes, reading body, etc.).  There still
are a number of cascading if-elif's within some of these states, and
some of them *may* be candidates for further optimization.

- There is an underlying HTMLInputStream that seems to be doing some
unnecessary position bookkeeping (positionLine and positionCol).
Commenting this out increases my test speed by about 13%.  In my
ignorance, I may be removing some important behavior, but this does
not seem to be critical as I tested against a few megs of HTML
source.  Before blaming the tokenizer for everything, there may be
more performance to be wrung from the input stream processor.  For
that matter, I would guess that about 90% of all HTML files that this
code would process would easily fit in memory - in that case, the
stream processing (and all of the attendant "if I'm not at the end of
the current chunk" code) could be skipped/removed entirely.

- The HTMLInputStream's charsUntil code is an already-identified
bottleneck, and some re enhancements have been applied here to help
out.

- Run-time construction of tuple literals where the tuple members are
constants can be lifted out.  emitCurrentToken rebuilds this tuple
every time it is called (which is a lot!):

    if (token["type"] in (tokenTypes["StartTag"], tokenTypes
["EndTag"], tokenTypes["EmptyTag"])):

Move this tuple literal into a class constant (or if you can tolerate
it, a default method argument to gain LOAD_FAST benefits - sometimes
optimization isn't pretty).

- These kinds of optimizations are pretty small, and only make sense
if they are called frequently.  Tallying which states are called in my
test gives the following list in decreasing frequency.  Such a list
would help guide your further tuning efforts:

tagNameState    194848
dataState       182179
attributeNameState      116507
attributeValueDoubleQuotedState 114931
tagOpenState    105556
beforeAttributeNameState        58612
beforeAttributeValueState       58216
afterAttributeValueState        58083
closeTagOpenState       50547
entityDataState 1673
attributeValueSingleQuotedState 1098
commentEndDashState     372
markupDeclarationOpenState      370
commentEndState 364
commentStartState       362
commentState    362
selfClosingStartTagState        359
doctypePublicIdentifierDoubleQuotedState        291
doctypeSystemIdentifierDoubleQuotedState        247
attributeValueUnQuotedState     191
doctypeNameState        32
beforeDoctypePublicIdentifierState      16
afterDoctypePublicIdentifierState       14
afterDoctypeNameState   9
doctypeState    8
beforeDoctypeNameState  8
afterDoctypeSystemIdentifierState       6
afterAttributeNameState 5
commentStartDashState   2
bogusCommentState       2

For instance, I wouldn't bother doing much tuning of the
bogusCommentState.  Anything called fewer than 50,000 times in this
test doesn't look like it would be worth the trouble.


-- Paul

(Thanks to those who suggested pyparsing as an alternative, but I
think this code is already beyond pyparsing in a few respects.  For
one thing, this code works with an input stream, in order to process
large HTML files; pyparsing *only* works with an in-memory string.
This code can also take advantage of some performance short cuts,
knowing that it is parsing HTML; pyparsing's generic classes can't do
that.)
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to