This is part of a project called "r.gere", for "random genetic evolution of regular expressions", which I started in September, 2001. I clearly haven't worked on it much since.
The overall goal of the project is to allow people to write regular expressions by example, with immediate feedback like Kodos gives you, but going in both directions. The r.gere program stochastically generates regular expressions and tries them out on training data you supply. Because regular languages are closed under union, you can test a lot of regular expressions in a single pass through the training data. The 2001 goal was to make it easier to pull implicit fields out of textual data --- at the time, I was doing a lot of HTML screen-scraping at Alpiri, and it would have been a lot faster to just highlight the relevant parts of the HTML and have a program randomly generate regular expressions to match them. Today, we have a problem more painful than our inability to automate Web browsing --- spam that circumvents filters. The current suite of word-based naive Bayesian classifiers are vulnerable to attacks on their word orientation --- splitting up words, using non-"word" characters in them, etc. So it occurred to me that a naive Bayesian classifier that used regular expression matches would probably do a better job. But how to know which regular expressions to try? Well, you could try a lot of them, and see which ones relate to the classification question you're trying to answer. You could even mutate and crossbreed them. So I took another whack at the problem. My initial naive attempt at generating regular expressions at random and feeding them to Python's NFA-based regex engine was foiled by the repeating-repeats problem. NFA-based regex engines don't do a good job coping with Kleene closure applied to a regular expression which can, itself, match the empty string; for example, x**, (x*)*, (x*|y)*, or (x*y*)*. The Python regexp engine tries to detect this situation when compiling the regular expression, but as of Python 2.3.3, it isn't completely successful, and instead, it raises a RuntimeError exception, "Maximum recursion depth exceeded", when you try to .match() the compiled version of, say, '(?:x*y*)*'. That expression (and '(?:x*|y)*') recognizes the same language as '(?:x|y)*', which doesn't have the same problem, but it isn't obvious to me how to use rewriting rules to do that transformation. I spent some time hooking the regular expression random generation up to the tree-rewriting system I posted sometime in the past, in order to write regexp stringification rules as tree rewriting rules instead of overloaded methods (I wrote the rewriting rules in my notebook during a sleepless night in Bolinas while Beatrice slept, after my laptop battery gave out), but so far my rewriting rules don't suffice to prevent regexp application errors. This program is far longer and more complex than it needs to be to do what it does, but I regret that I don't have the time to make it shorter tonight. In particular, the tree-rewriting engine is very slow, overly complex, and clumsy. Here are the rewrite rules for stringifying the regexps: # this set of rewriting rules turns parse-trees of regular expressions # into trees of string-generation and concatenation operations that # should produce a parsable regexp. # 'start' and 'end' mean "(?:" and ")" respectively, because this # tree-rewriting language doesn't have string literals. group(char($x)) -> _lit($x) group(kleene(kleene($x))) -> group(kleene($x)) group(kleene($x)) -> _concat(group($x) _str(*)) string(concat($x $y)) -> _concat(group($x) group($y)) string(altern($x $y)) -> _concat(_concat(group($x) _str(|)) group($y)) group($x) -> _concat(_concat(start string($x)) end) string(char($x)) -> group(char($x)) string(kleene($x)) -> group(kleene($x)) Here's the modified version of tree.py I'm using: #!/usr/bin/python # A tree is a possibly empty sequence of arguments, the first # of which is an atom. # The arguments may be trees, atoms, integers, or variables. # Variables are "abstract"; atoms and integers are not. # Trees are "abstract" iff they contain any abstract arguments. # Things that are not abstract are "concrete". # Abstract things can be "instantiated" with some mapping of # variables to concrete things, producing a concrete thing. # If an abstract thing can be instantiated to produce a # certain concrete thing, it is said to unify with it. # TODO: # - eliminate distinctions between trees, atoms, and integers # - remove integer type, use integer trees instead # - add integer functions: + - * / % < = > <= >= & | # - figure out how to implement Boolean logic completely (variables incl.) # - add some facility for nice I/O # - do some symbolic math with nice I/O from __future__ import nested_scopes import string, traceback, sys, StringIO class thing: def __init__(self, value): self.value = value def abstract(self): return 0 def __str__(self): return str(self.value) def instantiate(self, mapping): return self def compatible_type(self, other): return isinstance(other, self.__class__) def unify(self, concrete, mapping): if not self.compatible_type(concrete): return None if self.value != concrete.value: return None return mapping def __cmp__(self, other): if not self.compatible_type(other): return cmp(id(self), id(other)) return cmp(self.value, other.value) def toplevel_result(self, ostream): ostream.write('-> ' + str(self) + '\n') class variable(thing): def abstract(self): return 1 def __str__(self): return "$" + self.value def instantiate(self, mapping): return mapping[self.value] def unify(self, concrete, mapping): if mapping.has_key(self.value) and mapping[self.value] != concrete: return None mapping[self.value] = concrete return mapping class atom(thing): pass class integer(thing): pass class tree(thing): def __init__(self, items): assert isinstance(items[0], atom) self.items = items self.traversing = 0 self.is_abstract = 0 for item in self.items: if item.abstract(): self.is_abstract = 1 def abstract(self): return self.is_abstract def __str__(self): if self.traversing: return "((error))" self.traversing = 1 try: head, args = self.items[0], self.items[1:] return str(head) + "(" + string.join(map(str, args), ' ') + ')' finally: self.traversing = 0 def instantiate(self, mapping): if not self.abstract(): return self return tree(map(lambda item: item.instantiate(mapping), self.items)) def unify(self, concrete, mapping): if not self.compatible_type(concrete): return None if len(self.items) != len(concrete.items): return None for ii in range(len(self.items)): mapping = self.items[ii].unify(concrete.items[ii], mapping) if mapping is None: return None return mapping def __cmp__(self, other): if not self.compatible_type(other): return cmp(id(self), id(other)) for ii in range(len(self.items)): if ii >= len(other.items): return 1 rv = cmp(self.items[ii], other.items[ii]) if rv != 0: return rv if len(self.items) == len(other.items): return 0 else: return -1 class toplevel_action(thing): def __init__(self, value, mapping = None): self.value, self.mapping = value, mapping def instantiate(self, mapping): return self.__class__(self.value, mapping) def toplevel_result(self, ostream): return apply(self.value, (ostream,), self.mapping) def unify(abstract, concrete, mapping = None): """Return a mapping that instantiates the abstract as the concrete, or None If a mapping exists, find it and return it; otherwise, return None. """ if mapping is None: mapping = {} mapping = abstract.unify(concrete, mapping) if mapping is not None: assert abstract.instantiate(mapping) == concrete return mapping ### parsing def cons_string(astring): rv = () for ii in range(len(astring)-1, -1, -1): rv = (astring[ii], rv) return rv def append_charlists(aa, bb): if aa == (): return bb else: hh, tt = aa return (hh, append_charlists(tt, bb)) def restring(charlist): rv = [] while charlist: rv.append(charlist[0]) charlist = charlist[1] return string.join(rv, '') class parse_error(Exception): pass class unmatched_leftparen(parse_error): pass def parse_charlist(charlist): if charlist == (): raise parse_error("premature eof") hh, tt = charlist if hh == ')': raise parse_error("unmatched )") elif hh == '(': raise parse_error("( not following atom") elif hh in string.whitespace: return parse_charlist(tt) elif hh == '$': return parse_variable(tt) elif hh in string.digits: return parse_number(tt, hh) else: return parse_atomstuff(tt, hh) def parse_atomstuff(charlist, name=''): if charlist == (): return atom(name), charlist hh, tt = charlist if hh in string.whitespace + '()': return parse_postatom(strip_whitespace(charlist), name) else: return parse_atomstuff(tt, name + hh) def parse_variable(charlist): name, charlist = parse_atomstuff(charlist) if name.value == '': raise parse_error("missing variable name") return variable(name.value), charlist def parse_number(charlist, digits): if charlist == (): return integer(int(digits)), () hh, tt = charlist if hh in string.digits: return parse_number(tt, digits + hh) else: return integer(int(digits)), strip_whitespace(charlist) def parse_postatom(charlist, name): if name == '': raise parse_error("missing atom name") if charlist == (): return atom(name), () hh, tt = charlist if hh == '(': return parse_args(tt, (atom(name),)) else: return atom(name), charlist def parse_args(charlist, args): if charlist == (): raise unmatched_leftparen("unmatched (", args) hh, tt = charlist if hh == ')': return tree(args), strip_whitespace(tt) elif hh in string.whitespace: return parse_args(tt, args) else: nextitem, tt = parse_charlist(charlist) return parse_args(tt, args + (nextitem,)) def strip_whitespace(charlist): if charlist == (): return charlist hh, tt = charlist if hh in string.whitespace: return strip_whitespace(tt) else: return charlist def parse(astring): return parse_charlist(cons_string(astring)) def parseall(astring): result, remainder = parse(astring) assert remainder == () return result def rewrite(atree, before, after): mapping = unify(before, atree) if mapping is None: return None return after.instantiate(mapping) def eagerly_rewrite(athing, rules): if isinstance(athing, tree): items = map(lambda thing: eagerly_rewrite(thing, rules), athing.items) athing = tree(items) for before, after in rules: result = rewrite(athing, before, after) if result is not None: return eagerly_rewrite(result, rules) return athing class dummy_writer: def write(self, astring): pass def flush(self): pass class parser: def __init__(self, infile, lineno=0, outfile=dummy_writer(), prompt=''): self.infile, self.lineno, self.outfile, self.prompt = ( infile, lineno, outfile, prompt) def readline(self): self.outfile.write(self.prompt) self.outfile.flush() self.lineno = self.lineno + 1 return self.infile.readline() def parseline(self): """Parse at least a line from infile, returning a list of trees. When an argument list continues across lines, this function prefers parsing multiple lines to reporting an error. """ inputline = self.readline() startlineno = self.lineno if inputline == '': raise EOFError charlist = strip_whitespace(cons_string(inputline)) results = [] while charlist and charlist[0] != '#': try: tree, charlist = parse_charlist(charlist) except unmatched_leftparen, v: nextline = cons_string(self.readline()) if nextline == (): # eof raise unmatched_leftparen(*(v.args + ('line no', startlineno))) charlist = append_charlists(charlist, nextline) except parse_error, v: v.args = v.args + ('line no', startlineno) raise v else: results.append(tree) return results def test_parser(): infile = (''' x a b x() a() b() x( ) a( ) b( ) x() a( ''' + # lines 6-10 follow ''') b() # a comment snarg fuzz # another comment a( ) # yet another ''' + # lines 11-14 follow '''3 ) x() a( ) b( ''') p = parser(StringIO.StringIO(infile)) assert p.parseline() == [] assert p.lineno == 1 assert p.parseline() == map(parseall, 'x a b'.split()) assert p.lineno == 2 assert p.parseline() == map(parseall, 'x() a() b()'.split()) assert p.lineno == 3 assert p.parseline() == map(parseall, 'x() a() b()'.split()) assert p.lineno == 4 assert p.parseline() == map(parseall, 'x() a() b()'.split()) assert p.lineno == 6 assert p.parseline() == [] assert p.lineno == 7 assert p.parseline() == map(parseall, 'snarg fuzz'.split()) assert p.lineno == 8 assert p.parseline() == [parseall('a()')] assert p.lineno == 10 assert p.parseline() == [parseall('3')] assert p.lineno == 11 try: p.parseline() except parse_error, pe: assert 12 in pe.args, `pe.args` else: assert 0 assert_raises(lambda: p.parseline(), parse_error) assert_raises(lambda: p.parseline(), EOFError) class repl: """Play with rewriting expressions. a -> b rewrites a to b. d(peanuts) deletes rules for expressions starting with 'peanuts'. See the various example *.tree files packaged herewith for examples. """ def __init__(self): self.rules = [] #define_action('r($pattern $replacement)', define_rewrite_rule) self.define_action('d($atom)', self.delete_rules) self.define_action('import($filename)', self.import_file) def import_files(self, files): for f in files: self.i_file(sys.stdout, f) def define_rewrite_rule(self, ostream, pattern, replacement): self.rules.append((pattern, replacement)) ostream.write("will rewrite %s to %s\n" % (pattern, replacement)) def delete_rules(self, ostream, atom): ostream.write('forgetting about %s\n' % atom) self.rules[:] = [(pat, rep) for (pat, rep) in self.rules if (not hasattr(pat, 'items') or pat.items[0] != atom)] def define_action(self, patternstr, routine): self.rules.append((parseall(patternstr), toplevel_action(routine))) def import_file(self, ostream, filename): i_file(ostream, filename.value) def i_file(self, ostream, filename): infile = open(filename) self.readfile(parser(infile), ostream) infile.close() ostream.write("imported %s\n" % filename) def readfile(self, cmdlineparser, ostream): while 1: try: try: trees = cmdlineparser.parseline() except EOFError: ostream.write("\n"); return self.handle_trees(trees, ostream) except: traceback.print_exc() def handle_trees(self, trees, ostream): if len(trees) == 1: eagerly_rewrite(trees[0], self.rules[:]).toplevel_result(ostream) elif len(trees) == 0: pass elif len(trees) == 3 and trees[1] == atom('->'): self.define_rewrite_rule(ostream, trees[0], trees[2]) else: ostream.write("Couldn't understand that: %s\n" % (map(str, trees))) def run_interactively(self, files): self.import_files(files) cmdlineparser = parser(sys.stdin, outfile=sys.stdout, prompt=' ') self.readfile(cmdlineparser, sys.stdout) class wrong_exception(AssertionError): pass def assert_raises(subr, exc): try: subr() except exc: return 1 except: exc_type, exc_value, tb = sys.exc_info() raise wrong_exception, wrong_exception(exc_value), tb else: raise wrong_exception("no exception") def test(): def raise_TypeError(): raise TypeError assert_raises(raise_TypeError, TypeError) def assert_wrong(): assert_raises(raise_TypeError, ValueError) assert_raises(assert_wrong, AssertionError) assert_raises(assert_wrong, wrong_exception) def assert_parse_error(astr): assert_raises(lambda: parse(astr), parse_error) assert_parse_error('') assert_parse_error(')') assert_parse_error('(') assert_parse_error(' ') assert_parse_error(' )') assert_parse_error(' (') def dparse(astr): return parse(astr)[0] def punp(astr): return str(dparse(astr)) def teq(astr): assert punp(astr) == astr, punp(astr) teq('asdf') teq('asdf()') teq('1324') teq('asdf(a b c)') teq('a(1 3 b() c)') teq('[EMAIL PROTECTED]') teq('$x') teq('a(3 $x b($3) x)') assert punp(' as ') == 'as' assert punp(' x ( y 2 ) ') == 'x(y 2)' assert_parse_error('x(y(z)') assert_parse_error('x((y) z)') assert_parse_error('$') assert_parse_error('$()') ### instantiation atree = dparse('a(b($c d)$e)') assert str(atree) == 'a(b($c d) $e)', str(atree) atoma = atom('a') assert_raises(lambda: atree.instantiate({}), KeyError) insttree = atree.instantiate({'c': atoma, 'e': atoma}) assert str(insttree) == 'a(b(a d) a)', str(insttree) insttree = atree.instantiate({'c': atom('b'), 'e': atoma}) assert str(insttree) == 'a(b(b d) a)', str(insttree) ### comparison assert atom('a') == atom('a') assert atom('a') != atom('b') assert integer(1) == integer(1) assert integer(1) < integer(2) assert integer(2) > integer(1) assert variable('a') == variable('a') assert variable('a') != variable('b') assert atom('a') != variable('a') assert variable('a') != atom('a') assert atom('1') != variable('1') assert dparse('x(a b c(d e f()) g)') == dparse('x(a b c(d e f()) g)') assert dparse('x(a b c(d e f()) g)') != 'x(a b c(d e f()) g)' assert dparse('x(a b c(d e f()) g)') != dparse('x()') assert dparse('x(a)') != dparse('x()') assert dparse('x()') != dparse('x(a)') assert dparse('x(a)') == dparse('x(a)') assert dparse('x(a)') != dparse('x(b)') ### unification def sunify(stra, strb): return unify(dparse(stra), dparse(strb)) assert sunify('a', 'a') == {} assert sunify('a', 'b') == None assert sunify('$a', '3') == {'a': integer(3)} assert sunify('a(b($c d) $e)', 'a(b(a d) a)') == { 'c': atom('a'), 'e': atom('a')} assert sunify('a(b($c d) $e)', 'a(b(b d) a)') == { 'c': atom('b'), 'e': atom('a')} assert sunify('hi($x)', 'hi(kragen(sitaker))') == { 'x': dparse('kragen(sitaker)')} assert sunify('a($a $a)', 'a(b b)') == {'a': atom('b')} assert sunify('a($a $a)', 'a(b c)') == None ### rewriting def srew(stra, strb, strc): return str(rewrite(dparse(stra), dparse(strb), dparse(strc))) assert srew('a', 'b', 'c') == 'None' assert srew('a', '$b', 'c') == 'c' assert srew('a', '$b', '$b') == 'a' assert srew('a(b)', '$x', '$x') == 'a(b)' assert srew('a(b)', '$x', 'y($x)') == 'y(a(b))' assert srew('a(b)', 'a($x)', 'c($x)') == 'c(b)' ### repetitive rewriting def rules(alist): def parsepair(apair): a, b = apair return dparse(a), dparse(b) return map(parsepair, alist) def evalu(astr, rules): return str(eagerly_rewrite(dparse(astr), rules)) # "laws of form" boolean logic lofrules = rules([('not(not($x))', '$x'), ('or(1 $x)', '1'), ('or($x 1)', '1'), ('or($x $x)', '$x')]) assert evalu('1', lofrules) == '1' assert evalu('not(1)', lofrules) == 'not(1)' assert evalu('not(not(1))', lofrules) == '1' assert evalu('not(not(not(1)))', lofrules) == 'not(1)' assert evalu('not(not(not(not(1)))))', lofrules) == '1' assert evalu('or(1 not(1))', lofrules) == '1' assert evalu('or(not(1) 1)', lofrules) == '1' assert evalu('or(1 1)', lofrules) == '1' assert evalu('not(or(1 1))', lofrules) == 'not(1)' assert evalu('not(or(not(1) not(1)))', lofrules) == '1' a = cons_string('fibble') b = cons_string('wur') c = restring(append_charlists(a, b)) assert c == restring(a) + restring(b), c test_parser() if __name__ == '__main__': test() if len(sys.argv) > 1: repl().run_interactively(sys.argv[1:]) else: print repl.__doc__ repl().run_interactively([]) And here's 'genre.py', which generates random regular expressions. The commented-out 'ok' lines are ones that don't yet pass. #!/usr/bin/python import random, string, tree, sys, traceback, re ### definitions of regex tree node types class char: def __init__(self, thechar): self.thechar = thechar def render(self): if self.thechar not in string.letters + string.digits: return '\\' + self.thechar else: return self.thechar def __repr__(self): return 'char(%s)' % repr(self.thechar) atom = tree.atom('char') def as_tree(self): return tree.tree([self.atom, tree.atom(self.thechar)]) class kleene: def __init__(self, subre): self.subre = subre def render(self): return '(?:%s)*' % self.subre.render() def __repr__(self): return 'kleene(%s)' % repr(self.subre) atom = tree.atom('kleene') def as_tree(self): return tree.tree([self.atom, self.subre.as_tree()]) class seq: def __init__(self, re1, re2): (self.re1, self.re2) = (re1, re2) def render(self): return '%s%s' % (self.re1.render(), self.re2.render()) def __repr__(self): return 'seq(%s, %s)' % (repr(self.re1), repr(self.re2)) atom = tree.atom('concat') def as_tree(self): return tree.tree([self.atom, self.re1.as_tree(), self.re2.as_tree()]) class choice(seq): def render(self): return '(?:%s|%s)' % (self.re1.render(), self.re2.render()) def __repr__(self): return 'choice(%s, %s)' % (repr(self.re1), repr(self.re2)) atom = tree.atom('altern') ### random generation of regex trees def randchar(): return char(chr(random.randrange(32, 127))) def randkleene(): return kleene(randre()) def randseq(): return seq(randre(), randre()) def randchoice(): return choice(randre(), randre()) def randre(): # we give randchar and randkleene higher probability so this terminates return random.choice([randchar, randchar, randchar, randchar, randchar, randkleene, randkleene, randseq, randchoice])() ### converting regexes to strings def stringify(atree): mapping = tree.unify(tree.parse('_lit($x)')[0], atree) if mapping: # this is too much of a pain to write in the tree-rewriting language val = mapping['x'].value if val in string.letters + string.digits: return val else: return '\\' + val mapping = tree.unify(tree.parse('_str($x)')[0], atree) if mapping: return mapping['x'].value mapping = tree.unify(tree.parse('_concat($a $b)')[0], atree) if mapping: return stringify(mapping['a']) + stringify(mapping['b']) print "couldn't match %s" % atree return atree myrepl = tree.repl() nullfile = file('/dev/null', 'w') myrepl.i_file(nullfile, 're-rewrite.tree') # these are here because the parser can't cope myrepl.define_rewrite_rule(nullfile, tree.atom('start'), tree.tree([tree.atom('_str'), tree.atom('(?:')])) myrepl.define_rewrite_rule(nullfile, tree.atom('end'), tree.tree([tree.atom('_str'), tree.atom(')')])) def deparse(regex): rewritten = tree.eagerly_rewrite(tree.tree([tree.atom('string'), regex.as_tree()]), myrepl.rules[:]) return stringify(rewritten) ### tests def ok(a, b): if a == b: return print "mismatch: %s != %s" % (a, b) def test_stringify_tree(): ok(stringify(tree.tree([tree.atom('_str'), tree.atom('boo')])), 'boo') treein = '_concat(_concat(_concat(_concat(_str(x) _str(*)) _str(|)) _str(y)) _str(*))' atree = tree.parse(treein)[0] ok(str(atree), treein) # make sure we're using it right ok(stringify(atree), 'x*|y*') ok(stringify(tree.parse('_concat(_lit(*) _lit(a))')[0]), '\\*a') def test(): test_stringify_tree() ok(deparse(char('a')), 'a') ok(deparse(char('.')), '\\.') ok(deparse(kleene(choice(char('a'), char('b')))), '(?:a|b)*') ok(deparse(seq(choice(char('a'), char('b')), choice(char('c'), char('d')))), '(?:a|b)(?:c|d)') ok(deparse(kleene(seq(choice(char('a'), char('b')), choice(char('c'), char('d'))))), '(?:(?:a|b)(?:c|d))*') ok(deparse(kleene(char('x'))), 'x*') ok(deparse(seq(char('x'), char('y'))), 'xy') #ok(deparse(seq(char('x'), seq(char('y'), char('z')))), 'xyz') ok(deparse(choice(char('x'), char('y'))), 'x|y') #ok(deparse(choice(char('x'), choice(char('y'), char('z')))), 'x|y|z') #ok(deparse(choice(choice(char('x'), char('y')), char('z'))), 'x|y|z') #ok(deparse(choice(seq(char('x'), char('y')), char('z'))), 'xy|z') #ok(deparse(choice(char('x'), seq(char('y'), char('z')))), 'x|yz') ok(deparse(kleene(kleene(char('x')))), 'x*') # ok(deparse(choice(seq(char('@'), seq(char('o'), char('D'))), # choice(char('~'), char('9')))), # '[EMAIL PROTECTED]|\\~|9') ok(deparse(kleene(seq(kleene(char('d')), char('S')))), '(?:d*S)*') #ok(deparse(kleene(choice(kleene(seq(char('G'), char('+'))), # seq(kleene(char('0')), char('='))))), # '(?:(?:G\\+)*|O*\\=)*') test() if __name__ == '__main__': for ii in range(100): x = randre() xs = deparse(x) print x, xs try: print re.compile(xs) except: traceback.print_exc()