[Haskell-cafe] Regular Expressions, was Re: Interest in helping w/ Haskell standard

Conor McBride Sat, 15 Oct 2005 05:49:48 -0700

Hi folks,

Inspired by Ralf's post, I thought I'd just GADTize a dependently typedprogram I wrote in 2001.


Wolfgang Jeltsch wrote:

Now lets consider using an algebraic datatype for regexps:

        data RegExp
                = Empty | Single Char | RegExp :+: RegExp | RegExp :|: RegExpt 
| Iter RegExp
Manipulating regular expressions now becomes easy and safe – you are just notable to create "syntactically incorrect regular expressions" since duringruntime you don't deal with syntax at all.


A fancier variation on the same theme...

> data RegExp :: * -> * -> * where
>   Zero   :: RegExp tok Empty
>   One    :: RegExp tok ()
>   Check  :: (tok -> Bool) -> RegExp tok tok
>   Plus   :: RegExp tok a -> RegExp tok b -> RegExp tok (Either a b)
>   Mult   :: RegExp tok a -> RegExp tok b -> RegExp tok (a, b)
>   Star   :: RegExp tok a -> RegExp tok [a]

> data Empty

The intuition is that a RegExp tok output is a regular expressionexplaining how to parse a list of tok as an output. Here, Zero is theregexp which does not accept anything, One accepts just the emptystring, Plus is choice and Mult is sequential composition; Check letsyou decide whether you like a single token.

Regular expressions may be seen as an extended language of polynomialswith tokens for variables; this parser works by repeated application ofthe remainder theorem.


> parse :: RegExp tok x -> [tok] -> Maybe x
> parse r []       = empty r
> parse r (t : ts) = case divide t r of
>   Div q f -> return f `ap` parse q ts

Example

*RegExp> parse (Star (Mult (Star (Check (== 'a'))) (Star (Check (=='b'))))) "abaabaaabbbb"

Just [("a","b"),("aa","b"),("aaa","bbbb")]

The 'remainder' explains if a regular expression accepts the emptystring, and if so, how. The Star case is a convenientunderapproximation, ruling out repeated empty values.


> empty :: RegExp tok a -> Maybe a
> empty Zero         = mzero
> empty One          = return ()
> empty (Check _)    = mzero
> empty (Plus r1 r2) = (return Left  `ap` empty r1) `mplus`
>                      (return Right `ap` empty r2)
> empty (Mult r1 r2) = return (,) `ap` empty r1 `ap` empty r2
> empty (Star _)     = return []

The 'quotient' explains how to parse the tail of the list, and how torecover the meaning of the whole list from the meaning of the tail.


> data Division tok x = forall y. Div (RegExp tok y) (y -> x)

Here's how it's done. I didn't expect to need scoped type variables, butI did...


> divide :: tok -> RegExp tok x -> Division tok x
> divide t Zero                  = Div Zero naughtE
> divide t One                   = Div Zero naughtE
> divide t (Check p) | p t       = Div One (const t)
>                    | otherwise = Div Zero naughtE
> divide t (Plus (r1 :: RegExp tok a) (r2 :: RegExp tok b)) =
>   case (divide t r1, divide t r2) of
>     (Div (q1 :: RegExp tok a') (f1 :: a' -> a),
>       Div (q2 :: RegExp tok b') (f2 :: b' -> b)) ->
>       Div (Plus q1 q2) (f1 +++ f2)
> divide t (Mult r1 r2) = case (empty r1, divide t r1, divide t r2) of
>   (Nothing, Div q1 f1, _) -> Div (Mult q1 r2) (f1 *** id)
>   (Just x1, Div q1 f1, Div q2 f2) ->
>     Div (Plus (Mult q1 r2) q2) (either (f1 *** id) (((,) x1) . f2))
> divide t (Star r) = case (divide t r) of
>   Div q f -> Div (Mult q (Star r)) (\ (y, xs) -> (f y : xs))

Bureaucracy.

> (***) :: (a -> b) -> (c -> d) -> (a, c) -> (b, d)
> (f *** g) (a, c) = (f a, g c)

> (+++) :: (a -> b) -> (c -> d) -> Either a c -> Either b d
> (f +++ g) (Left a)  = Left  (f a)
> (f +++ g) (Right c) = Right (g c)

> naughtE :: Empty -> x
> naughtE = undefined

It's not the most efficient parser in the world (doing some algebraicsimplification on the fly wouldn't hurt), but it shows the sort of stuffyou can do.


Have fun

Conor
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Regular Expressions, was Re: Interest in helping w/ Haskell standard

Reply via email to