Sebastian Fischer schrieb:

On Jul 29, 2010, at 12:47 AM, Benedikt Huber wrote:

Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell.

Interesting. Would you mind elaborating on your code? Maybe even write a blog about how it works?
Hi,
so the current implementation (a little bit cleaned up at
  http://github.com/visq/llvm-regexp
), works as follows:

It generates an LLVM function which matches a bytestring against a given regular expression. The state of the 'automaton' consists of one bit for each leaf of the regexp AST, corresponding to the marks in the article's figure. It then generates straight-line code updating the state given an input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM).

For example, for matching '.a' , the generated code looks like this in a simplified pseudo code notation:

> ... next ~ first character matched
> ... ch   ~ input character
>   next1 = bitmask[0]              -- was '.' marked ?
>   bitmask[0] = next               -- mark '.' if initial
>   next2 = bitmask[1]              -- was 'a' marked ?
> bitmask[1] = ch == 'a' && next1 -- mark 'a' if '.' was marked
                                    -- and input is 'a'

At the end of the string, code is generated to check whether the automaton is in a final state (genFinalStateCheck). In the example above, this corresponds to

>   final = bitmask[1]

The rest is either LLVM boilerplate (regexMatcher) or adaptions from weighed-regexp. Additionally, I've adapted the Parser.y from weighted-regexp, but some things (e.g. character classes like \w) are not implemented.

Generating one big basic block for the whole regular expressions does not work well when there are more than a few thousand nodes in the AST. Using functions for large regular expressions and loop for repititions would be one solution. The other problem is that the matcher only operates on Word8s at the moment.

Trying it out (if you have llvm+haskell bindings) is also easy (do not cabal-easy):

> $ git clone g...@github.com:visq/llvm-regexp.git && cd llvm-regexp
> $ make
> $ ./Grep '.*spotlight.*' < /etc/passwd
< Line 45 matches

cheers, benedikt
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to