Re: [Python-Dev] Regular expressions, Unicode etc.

Mike Klaas Wed, 08 Aug 2007 13:30:15 -0700

In 8-Aug-07, at 12:47 PM, Nick Maclaren wrote:

>
>>> The other approach, which is to stick to true regular expressions,
>>> and wholly or partially convert to DFAs, has already been rendered
>>> impossible by even the limited Perl/PCRE extensions that Python
>>> has adopted.
>>
>> Impossible?  Surely, a sufficiently-competent re engine could detect
>> when a DFA is possible to construct?
>
> I doubt it.  While it isn't equivalent to the halting problem, it IS
> an intractable one!  There are two problems:
>
> Firstly, things like backreferences are an absolute no-no.  They
> are not regular, and REs with them in cannot be converted to DFAs.
> That could be 'solved' by a parser that kicked out such constructions,
> but it would get screams from many users.
>
> Secondly, anything involving explicit or implicit negation can lead
> to (if I recall) a super-exponential explosion in the size of the
> DFA.  That could be 'solved' by imposing a limit, but few people
> would be able to predict when it would bite.


Right.  The analysis I envisioned would be more along the lines of  
"if troublesome RE extensions are used, do not attempt to construct a  
DFA".  It could even be exposed via an alternate api (re.compile_dfa 
()) that admitted a subset of the usual grammar.

> Thirdly, I would require notice of the question of whether capturing
> parentheses could be supported, and what semantic changes would be
> to which were set and how.

Capturing groups are rather integral to the python regex api and, as  
you say, a major difficulty for DFA-based implementations.  Sounds  
like a task best left to a thirdparty package.

-Mike
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Regular expressions, Unicode etc.

Reply via email to