Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread John Nagle
Stefan Behnel wrote: John Nagle wrote: I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which supposedly

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread Stefan Behnel
John Nagle wrote: > I have a small web crawler robust enough to parse > real-world HTML, which can be appallingly bad. I currently use > an extra-robust version of BeautifulSoup, and even that sometimes > blows up. So I'm very interested in a new Python parser which supposedly > handles bad H

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread John Nagle
Steven D'Aprano wrote: On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread John Nagle
Steven D'Aprano wrote: On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote: Steven D'Aprano writes: Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but running through a series of tests un

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Jean-Michel Pichavant
Hendrik van Rooyen wrote: "Jean-Michel Pichavant" wrote: Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this out by myself if it

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread J Kenneth King
a...@pythoncraft.com (Aahz) writes: > In article , > Hendrik van Rooyen wrote: >> >>But wait - maybe if he passes an iterator around - the equivalent of >>for char in input_stream... Still no good though, unless the next call >>to the iterator is faster than an ordinary python call. > > Calls to

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Hendrik van Rooyen
"Jean-Michel Pichavant" wrote: > Woot ! I'll keep this one in my mind, while I may not be that concerned > by speed unlike the OP, I still find this way of doing very simple and > so intuitive (one will successfully argue how I was not figuring this > out by myself if it was so intuitive). > A

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Jean-Michel Pichavant
protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,other states} def state_machine(): next_step = protocol["start"]() while True: next_step = protocol[next_step]() Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike t

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Lawrence D'Oliveiro
In message <4a4f91f9$0$1587$742ec...@news.sonic.net>, John Nagle wrote: > ("It should be written in C" is not an acceptable answer.) I don't see why not. State machines that have to process input byte by byte are well known to be impossible to implement efficiently in high-level languages. That

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread David M . Cooke
Martin v. Löwis v.loewis.de> writes: > > This is a good test for Python implementation bottlenecks. Run > > that tokenizer on HTML, and see where the time goes. > > I looked at it with cProfile, and the top function that comes up > for a larger document (52k) is > ...validator.HTMLConformanceCh

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Martin v. Löwis
> This is a good test for Python implementation bottlenecks. Run > that tokenizer on HTML, and see where the time goes. I looked at it with cProfile, and the top function that comes up for a larger document (52k) is ...validator.HTMLConformanceChecker.__iter__. This method dispatches various val

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Aahz
In article , Hendrik van Rooyen wrote: > >But wait - maybe if he passes an iterator around - the equivalent of >for char in input_stream... Still no good though, unless the next call >to the iterator is faster than an ordinary python call. Calls to iterators created by generators are indeed fast

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Lino Mastrodomenico
2009/7/5 Hendrik van Rooyen : > I cannot see how you could avoid a python function call - even if he > bites the bullet and implements my laborious scheme, he would still > have to fetch the next character to test against, inside the current state. > > So if it is the function calls that is slowing

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
"Paul Rubin" wrote: > The series of tests is written that way because there is no case > statement available. It is essentially switching on a bunch of > character constants and then doing some additional tests in each > branch. > > It could be that using ord(c) as

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
"Steven D'Aprano" wrote: >On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: > >> Python is not C. > >John Nagle is an old hand at Python. He's perfectly aware of this, and >I'm sure he's not trying to program C in Python. > >I'm not entirely sure *what* he is doing, and hopefully he'

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: >Here's some actual code, from "tokenizer.py". This is called once > for each character in an HTML document, when in "data" state (outside > a tag). It's straightforward code, but look at all those > dictionary lookups. > > def dataState(self): > data = self.str

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul McGuire
On Jul 5, 3:12 am, "Hendrik van Rooyen" wrote: > > Use a dispatch dict, and have each state return the next state. > Then you can use strings representing state names, and > everybody will be able to understand the code. > > toy example, not tested, nor completed: > > protocol = {"start":initialis

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: > Stefan Behnel writes: >> You may notice that the creation of this exact tuple appears in almost all >> if the conditionals of this method. So it is part of the bottleneck. > > I don't think so. The tuple is only created when the character has > already matched, and for the vas

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel writes: > You may notice that the creation of this exact tuple appears in almost all > if the conditionals of this method. So it is part of the bottleneck. I don't think so. The tuple is only created when the character has already matched, and for the vast majority of the chars in

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: > Stefan Behnel writes: >>> # Keep a charbuffer to handle the escapeFlag >>> if self.contentModelFlag in\ >>> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]): >> Is the tuple >> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) >> c

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: > Stefan Behnel writes: >>> The series of tests is written that way because there is no case >>> statement available. It is essentially switching on a bunch of >>> character constants and then doing some additional tests in each >>> branch. >> Although doing some of the tests fir

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel writes: > > The series of tests is written that way because there is no case > > statement available. It is essentially switching on a bunch of > > character constants and then doing some additional tests in each > > branch. > Although doing some of the tests first and then checking

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel writes: > > # Keep a charbuffer to handle the escapeFlag > > if self.contentModelFlag in\ > > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]): > Is the tuple > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) > constant? If that is t

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: > Steven D'Aprano writes: >> Yes, I'm aware of that, but that's not what John's code is doing -- he's >> doing a series of if expr ... elif expr tests. I don't think a case >> statement can do much to optimize that. > > The series of tests is written that way because there is n

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Stefan Behnel wrote: > John Nagle wrote: >>Here's some actual code, from "tokenizer.py". This is called once >> for each character in an HTML document, when in "data" state (outside >> a tag). It's straightforward code, but look at all those >> dictionary lookups. >> >> def dataState(self

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Steven D'Aprano writes: > Yes, I'm aware of that, but that's not what John's code is doing -- he's > doing a series of if expr ... elif expr tests. I don't think a case > statement can do much to optimize that. The series of tests is written that way because there is no case statement available

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Steven D'Aprano
On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote: > Steven D'Aprano writes: >> Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't >> see how a case statement would help you here: you're not dispatching on >> a value, but running through a series of tests until one passes.

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Steven D'Aprano
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: > Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: > Paul Rubin wrote: >> John Nagle writes: >>> Python doesn't have a "switch" or "case" statement, and when >>> you need a state machine with many states, that makes for painful, >>> slow code. ... >>> There's a comment in the code that it would be useful >>> to run a few billion

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: >Here's some actual code, from "tokenizer.py". This is called once > for each character in an HTML document, when in "data" state (outside > a tag). It's straightforward code, but look at all those > dictionary lookups. > > def dataState(self): > data = self.str

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Steven D'Aprano writes: > Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't > see how a case statement would help you here: you're not dispatching on a > value, but running through a series of tests until one passes. A case statement switch(x):... into a bunch of constant

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: > Python doesn't have a "switch" or "case" statement, and when > you need a state machine with many states, that makes for painful, > slow code. Cython has a built-in optimisation that maps if-elif-else chains to C's switch statement if they only test a single int/char variable,

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Steven D'Aprano
On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote: > Paul Rubin wrote: >> John Nagle writes: >>> A dictionary lookup (actually, several of them) for every >>> input character is rather expensive. Tokenizers usually index into a >>> table of character classes, then use the character class i

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
"John Nagle" wrote: > Python doesn't have a "switch" or "case" statement, and when > you need a state machine with many states, that makes for painful, > slow code. There's a comment in the code that it would be useful > to run a few billion lines of HTML through an instrumented version > of the

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Nick Craig-Wood
John Nagle wrote: > As an example of code that really needs to run fast, but is > speed-limited by Python's limitations, see "tokenizer.py" in > > http://code.google.com/p/html5lib/ > > This is a parser for HTML 5, a piece of code that will be needed > in many places and will proce

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Carl Banks
On Jul 4, 4:35 pm, John Nagle wrote: >     The temptation is to write tokenizers in C, but that's an admission > of language design failure. No it isn't. It's only a failure of Python to be the language that does everything *you* want. Carl Banks -- http://mail.python.org/mailman/listinfo/pyth

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Aahz
In article <4a501a5e$0$1640$742ec...@news.sonic.net>, John Nagle wrote: > >Here's some actual code, from "tokenizer.py". This is called once >for each character in an HTML document, when in "data" state (outside >a tag). It's straightforward code, but look at all those >dictionary lookups.

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
Paul Rubin wrote: John Nagle writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a regexp (and then

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Ben Finney
John Nagle writes: >A dictionary lookup (actually, several of them) for every input > character is rather expensive. Tokenizers usually index into a table > of character classes, then use the character class index in a switch > statement. > >This is an issue that comes up whenever you ha

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Nobody
On Sat, 04 Jul 2009 16:35:08 -0700, John Nagle wrote: > The temptation is to write tokenizers in C, but that's an admission > of language design failure. The only part that really needs to be written in C is the DFA loop. The code to construct the state table from regexps could be written ent

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Paul Rubin
John Nagle writes: > A dictionary lookup (actually, several of them) for every > input character is rather expensive. Tokenizers usually index into > a table of character classes, then use the character class index in > a switch statement. Maybe you could use a regexp (and then have -two- pro

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
Paul Rubin wrote: John Nagle writes: Python doesn't have a "switch" or "case" statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented v

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Paul Rubin
John Nagle writes: > Python doesn't have a "switch" or "case" statement, and when > you need a state machine with many states, that makes for painful, > slow code. ... > There's a comment in the code that it would be useful > to run a few billion lines of HTML through an instrumented version > of

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Mel
John Nagle wrote: [ ... ] > Parsers have many named compile-time constants. Python doesn't support > named compile-time constants, and this is one of the places where we > have to pay the bill for that limitation. > > Something to think about when you need three more racks of servers > because th

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Benjamin Kaplan
On Sat, Jul 4, 2009 at 1:33 PM, John Nagle wrote: >   As an example of code that really needs to run fast, but is > speed-limited by Python's limitations, see "tokenizer.py" in > >        http://code.google.com/p/html5lib/ > > This is a parser for HTML 5, a piece of code that will be needed > in ma

Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see "tokenizer.py" in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and will process large amounts of data. It's writ