random generation of regular expressions

Kragen Sitaker Tue, 27 Jan 2004 01:39:50 -0800

This is part of a project called "r.gere", for "random genetic
evolution of regular expressions", which I started in September, 2001.
I clearly haven't worked on it much since.


The overall goal of the project is to allow people to write regular
expressions by example, with immediate feedback like Kodos gives you,
but going in both directions.  The r.gere program stochastically
generates regular expressions and tries them out on training data you
supply.  Because regular languages are closed under union, you can
test a lot of regular expressions in a single pass through the
training data.

The 2001 goal was to make it easier to pull implicit fields out of
textual data --- at the time, I was doing a lot of HTML
screen-scraping at Alpiri, and it would have been a lot faster to just
highlight the relevant parts of the HTML and have a program randomly
generate regular expressions to match them.

Today, we have a problem more painful than our inability to automate
Web browsing --- spam that circumvents filters.  The current suite of
word-based naive Bayesian classifiers are vulnerable to attacks on
their word orientation --- splitting up words, using non-"word"
characters in them, etc.

So it occurred to me that a naive Bayesian classifier that used
regular expression matches would probably do a better job.  But how to
know which regular expressions to try?  Well, you could try a lot of
them, and see which ones relate to the classification question you're
trying to answer.  You could even mutate and crossbreed them.

So I took another whack at the problem.

My initial naive attempt at generating regular expressions at random
and feeding them to Python's NFA-based regex engine was foiled by the
repeating-repeats problem.  NFA-based regex engines don't do a good
job coping with Kleene closure applied to a regular expression which
can, itself, match the empty string; for example, x**, (x*)*, (x*|y)*,
or (x*y*)*.  The Python regexp engine tries to detect this situation
when compiling the regular expression, but as of Python 2.3.3, it
isn't completely successful, and instead, it raises a RuntimeError
exception, "Maximum recursion depth exceeded", when you try to
.match() the compiled version of, say, '(?:x*y*)*'.  That expression
(and '(?:x*|y)*') recognizes the same language as '(?:x|y)*', which
doesn't have the same problem, but it isn't obvious to me how to use
rewriting rules to do that transformation.

I spent some time hooking the regular expression random generation up
to the tree-rewriting system I posted sometime in the past, in order
to write regexp stringification rules as tree rewriting rules instead
of overloaded methods (I wrote the rewriting rules in my notebook
during a sleepless night in Bolinas while Beatrice slept, after my
laptop battery gave out), but so far my rewriting rules don't suffice
to prevent regexp application errors.

This program is far longer and more complex than it needs to be to do
what it does, but I regret that I don't have the time to make it
shorter tonight.  In particular, the tree-rewriting engine is very
slow, overly complex, and clumsy.

Here are the rewrite rules for stringifying the regexps:

# this set of rewriting rules turns parse-trees of regular expressions
# into trees of string-generation and concatenation operations that
# should produce a parsable regexp.

# 'start' and 'end' mean "(?:" and ")" respectively, because this
# tree-rewriting language doesn't have string literals.

group(char($x)) -> _lit($x)
group(kleene(kleene($x))) -> group(kleene($x))
group(kleene($x)) -> _concat(group($x) _str(*))
string(concat($x $y)) -> _concat(group($x) group($y))
string(altern($x $y)) -> _concat(_concat(group($x) _str(|)) group($y))
group($x) -> _concat(_concat(start string($x)) end)
string(char($x)) -> group(char($x))
string(kleene($x)) -> group(kleene($x))



Here's the modified version of tree.py I'm using:

#!/usr/bin/python
# A tree is a possibly empty sequence of arguments, the first
# of which is an atom.
# The arguments may be trees, atoms, integers, or variables.
# Variables are "abstract"; atoms and integers are not.
# Trees are "abstract" iff they contain any abstract arguments.
# Things that are not abstract are "concrete".

# Abstract things can be "instantiated" with some mapping of
# variables to concrete things, producing a concrete thing.

# If an abstract thing can be instantiated to produce a
# certain concrete thing, it is said to unify with it.

# TODO:
# - eliminate distinctions between trees, atoms, and integers
# - remove integer type, use integer trees instead
# - add integer functions: + - * / % < = > <= >= & | # - figure out how to implement 
Boolean logic completely (variables incl.)
# - add some facility for nice I/O
# - do some symbolic math with nice I/O

from __future__ import nested_scopes

import string, traceback, sys, StringIO

class thing:
    def __init__(self, value): self.value = value
    def abstract(self): return 0
    def __str__(self): return str(self.value)
    def instantiate(self, mapping): return self
    def compatible_type(self, other):
        return isinstance(other, self.__class__)
    def unify(self, concrete, mapping):
        if not self.compatible_type(concrete): return None
        if self.value != concrete.value: return None
        return mapping
    def __cmp__(self, other):
        if not self.compatible_type(other): return cmp(id(self), id(other))
        return cmp(self.value, other.value)
    def toplevel_result(self, ostream):
        ostream.write('-> ' + str(self) + '\n')
class variable(thing):
    def abstract(self): return 1
    def __str__(self): return "$" + self.value
    def instantiate(self, mapping): return mapping[self.value]
    def unify(self, concrete, mapping):
        if mapping.has_key(self.value) and mapping[self.value] != concrete:
            return None
        mapping[self.value] = concrete
        return mapping
class atom(thing): pass
class integer(thing): pass
class tree(thing):
    def __init__(self, items):
        assert isinstance(items[0], atom)
        self.items = items
        self.traversing = 0
        self.is_abstract = 0
        for item in self.items:
            if item.abstract(): self.is_abstract = 1
    def abstract(self): return self.is_abstract
    def __str__(self):
        if self.traversing: return "((error))"
        self.traversing = 1
        try:
            head, args = self.items[0], self.items[1:]
            return str(head) + "(" + string.join(map(str, args), ' ') + ')'
        finally:
            self.traversing = 0
    def instantiate(self, mapping):
        if not self.abstract(): return self
        return tree(map(lambda item: item.instantiate(mapping), self.items))
    def unify(self, concrete, mapping):
        if not self.compatible_type(concrete): return None
        if len(self.items) != len(concrete.items): return None
        for ii in range(len(self.items)):
            mapping = self.items[ii].unify(concrete.items[ii], mapping)
            if mapping is None: return None
        return mapping
    def __cmp__(self, other):
        if not self.compatible_type(other): return cmp(id(self), id(other))
        for ii in range(len(self.items)):
            if ii >= len(other.items): return 1
            rv = cmp(self.items[ii], other.items[ii])
            if rv != 0: return rv
        if len(self.items) == len(other.items): return 0
        else: return -1

class toplevel_action(thing):
    def __init__(self, value, mapping = None):
        self.value, self.mapping = value, mapping
    def instantiate(self, mapping):
        return self.__class__(self.value, mapping)
    def toplevel_result(self, ostream):
        return apply(self.value, (ostream,), self.mapping)

def unify(abstract, concrete, mapping = None):
    """Return a mapping that instantiates the abstract as the concrete, or None

    If a mapping exists, find it and return it; otherwise, return None.

    """
    if mapping is None: mapping = {}
    mapping = abstract.unify(concrete, mapping)
    if mapping is not None:
        assert abstract.instantiate(mapping) == concrete
    return mapping

### parsing

def cons_string(astring):
    rv = ()
    for ii in range(len(astring)-1, -1, -1):
        rv = (astring[ii], rv)
    return rv

def append_charlists(aa, bb):
    if aa == (): return bb
    else:
        hh, tt = aa
        return (hh, append_charlists(tt, bb))

def restring(charlist):
    rv = []
    while charlist:
        rv.append(charlist[0])
        charlist = charlist[1]
    return string.join(rv, '')

class parse_error(Exception): pass
class unmatched_leftparen(parse_error): pass

def parse_charlist(charlist):
    if charlist == (): raise parse_error("premature eof")
    hh, tt = charlist
    if hh == ')': raise parse_error("unmatched )")
    elif hh == '(': raise parse_error("( not following atom")
    elif hh in string.whitespace: return parse_charlist(tt)
    elif hh == '$': return parse_variable(tt)
    elif hh in string.digits: return parse_number(tt, hh)
    else: return parse_atomstuff(tt, hh)

def parse_atomstuff(charlist, name=''):
    if charlist == (): return atom(name), charlist
    hh, tt = charlist
    if hh in string.whitespace + '()':
        return parse_postatom(strip_whitespace(charlist), name)
    else: return parse_atomstuff(tt, name + hh)

def parse_variable(charlist):
    name, charlist = parse_atomstuff(charlist)
    if name.value == '': raise parse_error("missing variable name")
    return variable(name.value), charlist

def parse_number(charlist, digits):
    if charlist == (): return integer(int(digits)), ()
    hh, tt = charlist
    if hh in string.digits: return parse_number(tt, digits + hh)
    else: return integer(int(digits)), strip_whitespace(charlist)

def parse_postatom(charlist, name):
    if name == '': raise parse_error("missing atom name")
    if charlist == (): return atom(name), ()
    hh, tt = charlist
    if hh == '(': return parse_args(tt, (atom(name),))
    else: return atom(name), charlist

def parse_args(charlist, args):
    if charlist == (): raise unmatched_leftparen("unmatched (", args)
    hh, tt = charlist
    if hh == ')': return tree(args), strip_whitespace(tt)
    elif hh in string.whitespace: return parse_args(tt, args)
    else:
        nextitem, tt = parse_charlist(charlist)
        return parse_args(tt, args + (nextitem,))

def strip_whitespace(charlist):
    if charlist == (): return charlist
    hh, tt = charlist
    if hh in string.whitespace: return strip_whitespace(tt)
    else: return charlist
    
def parse(astring):
    return parse_charlist(cons_string(astring))
def parseall(astring):
    result, remainder = parse(astring)
    assert remainder == ()
    return result

def rewrite(atree, before, after):
    mapping = unify(before, atree)
    if mapping is None: return None
    return after.instantiate(mapping)

def eagerly_rewrite(athing, rules):
    if isinstance(athing, tree):
        items = map(lambda thing: eagerly_rewrite(thing, rules), athing.items)
        athing = tree(items)
    for before, after in rules:
        result = rewrite(athing, before, after)
        if result is not None: return eagerly_rewrite(result, rules)
    return athing

class dummy_writer:
    def write(self, astring): pass
    def flush(self): pass

class parser:
    def __init__(self, infile, lineno=0, outfile=dummy_writer(), prompt=''):
        self.infile, self.lineno, self.outfile, self.prompt = (
            infile, lineno, outfile, prompt)
    def readline(self):
        self.outfile.write(self.prompt)
        self.outfile.flush()
        self.lineno = self.lineno + 1
        return self.infile.readline()
    def parseline(self):
        """Parse at least a line from infile, returning a list of trees.
                  
        When an argument list continues across lines, this function
        prefers parsing multiple lines to reporting an error. """
        inputline = self.readline()
        startlineno = self.lineno
        if inputline == '': raise EOFError
        charlist = strip_whitespace(cons_string(inputline))
        results = []
        while charlist and charlist[0] != '#':
            try: tree, charlist = parse_charlist(charlist)
            except unmatched_leftparen, v:
                nextline = cons_string(self.readline())
                if nextline == ():  # eof
                    raise unmatched_leftparen(*(v.args +
                                              ('line no', startlineno)))
                charlist = append_charlists(charlist, nextline)
            except parse_error, v:
                v.args = v.args + ('line no', startlineno)
                raise v
            else:
                results.append(tree)
        return results

def test_parser():
    infile = ('''
    x a b
    x() a() b()
    x( ) a( ) b( )
    x() a(
    ''' + # lines 6-10 follow
    ''') b()
    # a comment
    snarg fuzz  # another comment
    a(
    ) # yet another
    ''' + # lines 11-14 follow
    '''3
    )
    x() a(
    ) b(
    ''')
    p = parser(StringIO.StringIO(infile))
    assert p.parseline() == []
    assert p.lineno == 1
    assert p.parseline() == map(parseall, 'x a b'.split())
    assert p.lineno == 2
    assert p.parseline() == map(parseall, 'x() a() b()'.split())
    assert p.lineno == 3
    assert p.parseline() == map(parseall, 'x() a() b()'.split())
    assert p.lineno == 4
    assert p.parseline() == map(parseall, 'x() a() b()'.split())
    assert p.lineno == 6
    assert p.parseline() == []
    assert p.lineno == 7
    assert p.parseline() == map(parseall, 'snarg fuzz'.split())
    assert p.lineno == 8
    assert p.parseline() == [parseall('a()')]
    assert p.lineno == 10
    assert p.parseline() == [parseall('3')]
    assert p.lineno == 11
    try: p.parseline()
    except parse_error, pe:
        assert 12 in pe.args, `pe.args`
    else:
        assert 0
    assert_raises(lambda: p.parseline(), parse_error)
    assert_raises(lambda: p.parseline(), EOFError)

class repl:
    """Play with rewriting expressions.  a -> b rewrites a to b.

    d(peanuts) deletes rules for expressions starting with 'peanuts'.

    See the various example *.tree files packaged herewith for
    examples.

    """
    def __init__(self):
        self.rules = []
        #define_action('r($pattern $replacement)', define_rewrite_rule)
        self.define_action('d($atom)', self.delete_rules)
        self.define_action('import($filename)', self.import_file)
    def import_files(self, files):
        for f in files: self.i_file(sys.stdout, f)
    def define_rewrite_rule(self, ostream, pattern, replacement):
        self.rules.append((pattern, replacement))
        ostream.write("will rewrite %s to %s\n" % (pattern, replacement))
    def delete_rules(self, ostream, atom):
        ostream.write('forgetting about %s\n' % atom)
        self.rules[:] = [(pat, rep) for (pat, rep) in self.rules
                         if (not hasattr(pat, 'items')
                             or pat.items[0] != atom)]
    def define_action(self, patternstr, routine):
        self.rules.append((parseall(patternstr), toplevel_action(routine)))
    def import_file(self, ostream, filename): i_file(ostream, filename.value)
    def i_file(self, ostream, filename):
        infile = open(filename)
        self.readfile(parser(infile), ostream)
        infile.close()
        ostream.write("imported %s\n" % filename)
    def readfile(self, cmdlineparser, ostream):
        while 1:
            try:
                try: trees = cmdlineparser.parseline()
                except EOFError: ostream.write("\n"); return
                self.handle_trees(trees, ostream)
            except:
                traceback.print_exc()
    def handle_trees(self, trees, ostream):
        if len(trees) == 1:
            eagerly_rewrite(trees[0],
                            self.rules[:]).toplevel_result(ostream)
        elif len(trees) == 0:
            pass
        elif len(trees) == 3 and trees[1] == atom('->'):
            self.define_rewrite_rule(ostream, trees[0], trees[2])
        else:
            ostream.write("Couldn't understand that: %s\n" %
                                  (map(str, trees)))
    def run_interactively(self, files):
        self.import_files(files)
        cmdlineparser = parser(sys.stdin, outfile=sys.stdout, prompt='      ')
        self.readfile(cmdlineparser, sys.stdout)

class wrong_exception(AssertionError): pass

def assert_raises(subr, exc):
    try:
        subr()
    except exc:
        return 1
    except:
        exc_type, exc_value, tb = sys.exc_info()
        raise wrong_exception, wrong_exception(exc_value), tb
    else:
        raise wrong_exception("no exception")

def test():
    def raise_TypeError():
        raise TypeError
    assert_raises(raise_TypeError, TypeError)
    def assert_wrong():
        assert_raises(raise_TypeError, ValueError)
    assert_raises(assert_wrong, AssertionError)
    assert_raises(assert_wrong, wrong_exception)

    def assert_parse_error(astr):
        assert_raises(lambda: parse(astr), parse_error)

    assert_parse_error('')
    assert_parse_error(')')
    assert_parse_error('(')
    assert_parse_error('  ')
    assert_parse_error(' )')
    assert_parse_error(' (')

    def dparse(astr): return parse(astr)[0]
    def punp(astr): return str(dparse(astr))
    def teq(astr):
        assert punp(astr) == astr, punp(astr)

    teq('asdf')
    teq('asdf()')
    teq('1324')
    teq('asdf(a b c)')
    teq('a(1 3 b() c)')
    teq('[EMAIL PROTECTED]')
    teq('$x')
    teq('a(3 $x b($3) x)')
    assert punp('   as   ') == 'as'
    assert punp('  x  (  y  2  )  ') == 'x(y 2)'
    assert_parse_error('x(y(z)')
    assert_parse_error('x((y) z)')
    assert_parse_error('$')
    assert_parse_error('$()')

    ### instantiation
    atree = dparse('a(b($c d)$e)')
    assert str(atree) == 'a(b($c d) $e)', str(atree)
    atoma = atom('a')
    assert_raises(lambda: atree.instantiate({}), KeyError)
    insttree = atree.instantiate({'c': atoma, 'e': atoma})
    assert str(insttree) == 'a(b(a d) a)', str(insttree)
    insttree = atree.instantiate({'c': atom('b'), 'e': atoma})
    assert str(insttree) == 'a(b(b d) a)', str(insttree)

    ### comparison
    assert atom('a') == atom('a')
    assert atom('a') != atom('b')
    assert integer(1) == integer(1)
    assert integer(1) < integer(2)
    assert integer(2) > integer(1)
    assert variable('a') == variable('a')
    assert variable('a') != variable('b')
    assert atom('a') != variable('a')
    assert variable('a') != atom('a')
    assert atom('1') != variable('1')
    assert dparse('x(a  b c(d e f()) g)') == dparse('x(a b c(d e f()) g)')
    assert dparse('x(a  b c(d e f()) g)') != 'x(a b c(d e f()) g)'
    assert dparse('x(a  b c(d e f()) g)') != dparse('x()')
    assert dparse('x(a)') != dparse('x()')
    assert dparse('x()') != dparse('x(a)')
    assert dparse('x(a)') == dparse('x(a)')
    assert dparse('x(a)') != dparse('x(b)')

    ### unification
    def sunify(stra, strb):
        return unify(dparse(stra), dparse(strb))
    assert sunify('a', 'a') == {}
    assert sunify('a', 'b') == None
    assert sunify('$a', '3') == {'a': integer(3)}
    assert sunify('a(b($c d) $e)', 'a(b(a d) a)') == {
        'c': atom('a'),
        'e': atom('a')}
    assert sunify('a(b($c d) $e)', 'a(b(b d) a)') == {
        'c': atom('b'),
        'e': atom('a')}
    assert sunify('hi($x)', 'hi(kragen(sitaker))') == {
        'x': dparse('kragen(sitaker)')}
    assert sunify('a($a $a)', 'a(b b)') == {'a': atom('b')}
    assert sunify('a($a $a)', 'a(b c)') == None

    ### rewriting
    def srew(stra, strb, strc):
        return str(rewrite(dparse(stra), dparse(strb), dparse(strc)))
    assert srew('a', 'b', 'c') == 'None'
    assert srew('a', '$b', 'c') == 'c'
    assert srew('a', '$b', '$b') == 'a'
    assert srew('a(b)', '$x', '$x') == 'a(b)'
    assert srew('a(b)', '$x', 'y($x)') == 'y(a(b))'
    assert srew('a(b)', 'a($x)', 'c($x)') == 'c(b)'

    ### repetitive rewriting
    def rules(alist):
        def parsepair(apair):
            a, b = apair
            return dparse(a), dparse(b)
        return map(parsepair, alist)
    def evalu(astr, rules):
        return str(eagerly_rewrite(dparse(astr), rules))
    # "laws of form" boolean logic
    lofrules = rules([('not(not($x))', '$x'),
                      ('or(1 $x)', '1'),
                      ('or($x 1)', '1'),
                      ('or($x $x)', '$x')])
    assert evalu('1', lofrules) == '1'
    assert evalu('not(1)', lofrules) == 'not(1)'
    assert evalu('not(not(1))', lofrules) == '1'
    assert evalu('not(not(not(1)))', lofrules) == 'not(1)'
    assert evalu('not(not(not(not(1)))))', lofrules) == '1'
    assert evalu('or(1 not(1))', lofrules) == '1'
    assert evalu('or(not(1) 1)', lofrules) == '1'
    assert evalu('or(1 1)', lofrules) == '1'
    assert evalu('not(or(1 1))', lofrules) == 'not(1)'
    assert evalu('not(or(not(1) not(1)))', lofrules) == '1'

    a = cons_string('fibble')
    b = cons_string('wur')
    c = restring(append_charlists(a, b))
    assert c == restring(a) + restring(b), c
    test_parser()

if __name__ == '__main__':
    test()
    if len(sys.argv) > 1:
        repl().run_interactively(sys.argv[1:])
    else:
        print repl.__doc__
        repl().run_interactively([])


And here's 'genre.py', which generates random regular expressions.
The commented-out 'ok' lines are ones that don't yet pass.

#!/usr/bin/python
import random, string, tree, sys, traceback, re

### definitions of regex tree node types

class char:
    def __init__(self, thechar): self.thechar = thechar
    def render(self):
        if self.thechar not in string.letters + string.digits: return '\\' + 
self.thechar
        else: return self.thechar
    def __repr__(self): return 'char(%s)' % repr(self.thechar)
    atom = tree.atom('char')
    def as_tree(self): return tree.tree([self.atom, tree.atom(self.thechar)])
class kleene:
    def __init__(self, subre): self.subre = subre
    def render(self): return '(?:%s)*' % self.subre.render()
    def __repr__(self): return 'kleene(%s)' % repr(self.subre)
    atom = tree.atom('kleene')
    def as_tree(self): return tree.tree([self.atom, self.subre.as_tree()])
class seq:
    def __init__(self, re1, re2): (self.re1, self.re2) = (re1, re2)
    def render(self): return '%s%s' % (self.re1.render(), self.re2.render())
    def __repr__(self): return 'seq(%s, %s)' % (repr(self.re1), repr(self.re2))
    atom = tree.atom('concat')
    def as_tree(self):
        return tree.tree([self.atom, self.re1.as_tree(), self.re2.as_tree()])
class choice(seq):
    def render(self): return '(?:%s|%s)' % (self.re1.render(), self.re2.render())
    def __repr__(self): return 'choice(%s, %s)' % (repr(self.re1), repr(self.re2))
    atom = tree.atom('altern')

### random generation of regex trees

def randchar(): return char(chr(random.randrange(32, 127)))
def randkleene(): return kleene(randre())
def randseq(): return seq(randre(), randre())
def randchoice(): return choice(randre(), randre())

def randre():
    # we give randchar and randkleene higher probability so this terminates
    return random.choice([randchar, randchar, randchar, randchar, randchar,
                          randkleene, randkleene, randseq, randchoice])()

### converting regexes to strings

def stringify(atree):
    mapping = tree.unify(tree.parse('_lit($x)')[0], atree)
    if mapping:
        # this is too much of a pain to write in the tree-rewriting language
        val = mapping['x'].value
        if val in string.letters + string.digits: return val
        else: return '\\' + val
    mapping = tree.unify(tree.parse('_str($x)')[0], atree)
    if mapping: return mapping['x'].value
    mapping = tree.unify(tree.parse('_concat($a $b)')[0], atree)
    if mapping: return stringify(mapping['a']) + stringify(mapping['b'])
    print "couldn't match %s" % atree
    return atree

myrepl = tree.repl()
nullfile = file('/dev/null', 'w')
myrepl.i_file(nullfile, 're-rewrite.tree')
# these are here because the parser can't cope
myrepl.define_rewrite_rule(nullfile, tree.atom('start'),
                           tree.tree([tree.atom('_str'), tree.atom('(?:')]))
myrepl.define_rewrite_rule(nullfile, tree.atom('end'),
                           tree.tree([tree.atom('_str'), tree.atom(')')]))
def deparse(regex):
    rewritten = tree.eagerly_rewrite(tree.tree([tree.atom('string'), 
regex.as_tree()]), myrepl.rules[:])
    return stringify(rewritten)

### tests

def ok(a, b):
    if a == b: return
    print "mismatch: %s != %s" % (a, b)

def test_stringify_tree():
    ok(stringify(tree.tree([tree.atom('_str'), tree.atom('boo')])), 'boo')

    treein = '_concat(_concat(_concat(_concat(_str(x) _str(*)) _str(|)) _str(y)) 
_str(*))'
    atree = tree.parse(treein)[0]
    ok(str(atree), treein)  # make sure we're using it right
    ok(stringify(atree), 'x*|y*')
    ok(stringify(tree.parse('_concat(_lit(*) _lit(a))')[0]), '\\*a')

    
def test():
    test_stringify_tree()
    ok(deparse(char('a')), 'a')
    ok(deparse(char('.')), '\\.')
    ok(deparse(kleene(choice(char('a'), char('b')))), '(?:a|b)*')
    ok(deparse(seq(choice(char('a'), char('b')), choice(char('c'), char('d')))),
       '(?:a|b)(?:c|d)')
    ok(deparse(kleene(seq(choice(char('a'), char('b')), choice(char('c'), 
char('d'))))),
       '(?:(?:a|b)(?:c|d))*')
    ok(deparse(kleene(char('x'))), 'x*')
    ok(deparse(seq(char('x'), char('y'))), 'xy')
    #ok(deparse(seq(char('x'), seq(char('y'), char('z')))), 'xyz')
    ok(deparse(choice(char('x'), char('y'))), 'x|y')
    #ok(deparse(choice(char('x'), choice(char('y'), char('z')))), 'x|y|z')
    #ok(deparse(choice(choice(char('x'), char('y')), char('z'))), 'x|y|z')
    #ok(deparse(choice(seq(char('x'), char('y')), char('z'))), 'xy|z')
    #ok(deparse(choice(char('x'), seq(char('y'), char('z')))), 'x|yz')
    ok(deparse(kleene(kleene(char('x')))), 'x*')
    # ok(deparse(choice(seq(char('@'), seq(char('o'), char('D'))),
    #                   choice(char('~'), char('9')))),
    #    '[EMAIL PROTECTED]|\\~|9')
    ok(deparse(kleene(seq(kleene(char('d')), char('S')))),
       '(?:d*S)*')
    #ok(deparse(kleene(choice(kleene(seq(char('G'), char('+'))),
    #                          seq(kleene(char('0')), char('='))))),
    #    '(?:(?:G\\+)*|O*\\=)*')

test()

if __name__ == '__main__':
    for ii in range(100):
        x = randre()
        xs = deparse(x)
        print x, xs
        try: print re.compile(xs)
        except: traceback.print_exc()

random generation of regular expressions

Reply via email to