On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt <tijn...@gmail.com> wrote:

> It is possible to write a whole parser as a single regex, being it terribly
> long and complex.

While regular expressions are often used in the lexer--the part that scans
the input stream and breaks it up into meaningful tokens like

    { keyword: "function" }
    { operator: "+" }


    { identifier: "$foo" }

that form the building blocks of the language--they aren't combined into a
single expression. Instead, a lexer generator is used to build a state
machine that switches the active expressions to check based on the previous
tokens and context. Each expression recognizes a different type of token,
and many times these aren't even regular expressions.

The second stage--combining tokens based on the rules of the grammar--is
more complex and beyond the abilities of regular expressions. There are
plenty of books on the subject and tools [1] to build the pieces such as
Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack
Overflow [2] a few years ago. And I'm sure if you look you can find someone
that did a masters thesis proving that regular expressions cannot handle a
context-free grammar. And finally I leave you with Jeff Atwood's article
about (not) parsing HTML with regex. [3]


[1] http://dinosaur.compilertools.net/

Reply via email to