That reminds me of the undocumented re.Scanner -- which is meant to do
exactly this.  Wouldn't it be about time to document or remove it?

Georg

Am 16.09.2010 14:02, schrieb raymond.hettinger:
> Author: raymond.hettinger
> Date: Thu Sep 16 14:02:17 2010
> New Revision: 84847
> 
> Log:
> Add tokenizer example to regex docs.
> 
> Modified:
>    python/branches/py3k/Doc/library/re.rst
> 
> Modified: python/branches/py3k/Doc/library/re.rst
> ==============================================================================
> --- python/branches/py3k/Doc/library/re.rst   (original)
> +++ python/branches/py3k/Doc/library/re.rst   Thu Sep 16 14:02:17 2010
> @@ -1282,3 +1282,66 @@
>     <_sre.SRE_Match object at ...>
>     >>> re.match("\\\\", r"\\")
>     <_sre.SRE_Match object at ...>
> +
> +
> +Writing a Tokenizer
> +^^^^^^^^^^^^^^^^^^^
> +
> +A `tokenizer or scanner <http://en.wikipedia.org/wiki/Lexical_analysis>`_
> +analyzes a string to categorize groups of characters.  This is a useful first
> +step in writing a compiler or interpreter.
> +
> +The text categories are specified with regular expressions.  The technique is
> +to combine those into a single master regular expression and to loop over
> +successive matches::
> +
> +    Token = collections.namedtuple('Token', 'typ value line column')
> +
> +    def tokenize(s):
> +        tok_spec = [
> +            ('NUMBER', r'\d+(.\d+)?'),  # Integer or decimal number
> +            ('ASSIGN', r':='),          # Assignment operator
> +            ('END', ';'),               # Statement terminator
> +            ('ID', r'[A-Za-z]+'),       # Identifiers
> +            ('OP', r'[+*\/\-]'),        # Arithmetic operators
> +            ('NEWLINE', r'\n'),         # Line endings
> +            ('SKIP', r'[ \t]'),         # Skip over spaces and tabs
> +        ]
> +        tok_re = '|'.join('(?P<%s>%s)' % pair for pair in tok_spec)
> +        gettok = re.compile(tok_re).match
> +        line = 1
> +        pos = line_start = 0
> +        mo = gettok(s)
> +        while mo is not None:
> +            typ = mo.lastgroup
> +            if typ == 'NEWLINE':
> +                line_start = pos
> +                line += 1
> +            elif typ != 'SKIP':
> +                yield Token(typ, mo.group(typ), line, mo.start()-line_start)
> +            pos = mo.end()
> +            mo = gettok(s, pos)
> +        if pos != len(s):
> +            raise RuntimeError('Unexpected character %r on line %d' 
> %(s[pos], line))
> +
> +    >>> statements = '''\
> +        total := total + price * quantity;
> +        tax := price * 0.05;
> +    '''
> +    >>> for token in tokenize(statements):
> +    ...     print(token)
> +    ...
> +    Token(typ='ID', value='total', line=1, column=8)
> +    Token(typ='ASSIGN', value=':=', line=1, column=14)
> +    Token(typ='ID', value='total', line=1, column=17)
> +    Token(typ='OP', value='+', line=1, column=23)
> +    Token(typ='ID', value='price', line=1, column=25)
> +    Token(typ='OP', value='*', line=1, column=31)
> +    Token(typ='ID', value='quantity', line=1, column=33)
> +    Token(typ='END', value=';', line=1, column=41)
> +    Token(typ='ID', value='tax', line=2, column=9)
> +    Token(typ='ASSIGN', value=':=', line=2, column=13)
> +    Token(typ='ID', value='price', line=2, column=16)
> +    Token(typ='OP', value='*', line=2, column=22)
> +    Token(typ='NUMBER', value='0.05', line=2, column=24)
> +    Token(typ='END', value=';', line=2, column=28)


-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to