On Apr 8, 5:13 am, Nobody <nob...@nowhere.com> wrote: > On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote: > >> Regular expressions != Parsers > > > True, but lots of parsers *use* regular expressions in their > > tokenizers. In fact, if you have a pure Python parser, you can often > > get huge performance gains by rearranging your code slightly so that > > you can use regular expressions in your tokenizer, because that > > effectively gives you access to a fast, specialized C library that is > > built into practically every Python interpreter on the planet.
> Unfortunately, a typical regexp library (including Python's) doesn't allow > you to match against a set of regexps, returning the index of which one > matched. Which is what you really want for a tokeniser. Actually, there is some not very-well-documented code in the re module that will let you do exactly that. But even not using that code, using a first cut of re.split() or re.finditer() to break the string apart into tokens (without yet classifying them) is usually a huge performance win over anything else in the standard library (or that you could write in pure Python) for this task. > >> Every time someone tries to parse nested structures using regular > >> expressions, Jamie Zawinski kills a puppy. > > > And yet, if you are parsing stuff in Python, and your parser doesn't > > use some specialized C code for tokenization (which will probably be > > regular expressions unless you are using mxtexttools or some other > > specialized C tokenizer code), your nested structure parser will be > > dog slow. > > The point is that you *cannot* match arbitrarily-nested expressions using > regexps. You could, in theory, write a regexp which will match any valid > syntax up to N levels of nesting, for any finite N. But in practice, the > regexp is going to look horrible (and is probably going to be quite > inefficient if the regexp library uses backtracking rather than a DFA). Trust me, I already knew that. But what you just wrote is a much more useful thing to tell the OP than "Every time someone tries to parse nested structures using regular expressions, Jamie Zawinski kills a puppy" which is what I was responding to. And right after regurgitating that inside joke, Chris Rebert then went on to say "Try using an *actual* parser, such as Pyparsing". Which is all well and good, except then the OP will download pyparsing, take a look, realize that it uses regexps under the hood, and possibly be very confused. > Even tokenising with Python's regexp interface is inefficient if the > number of token types is large, as you have to test against each regexp > sequentially. It's not that bad if you do it right. You can first rip things apart, then use a dict-based scheme to categorize them. > Ultimately, if you want an efficient parser, you need something with a C > component, e.g. Plex. There is no doubt that you can get better performance with C than with Python. But, for a lot of tasks, the Python performance is acceptable, and, as always, algorithm, algorithm, algorithm... A case in point. My pure Python RSON parser is faster on my computer on a real-world dataset of JSON data than the json decoder that comes with Python 2.6, *even with* the json decoder's C speedups enabled. Having said that, the current subversion pure Python simplejson parser is slightly faster than my RSON parser, and the C reimplementation of the parser in current subversion simplejson completely blows the doors off my RSON parser. So, a naive translation to C, even by an experienced programmer, may not do as much for you as an algorithm rework. Regards, Pat -- http://mail.python.org/mailman/listinfo/python-list