Currently when you add a new token you need to change a couple of files:

* Include/token.h
* _PyParser_TokenNames in Parser/tokenizer.c
* PyToken_OneChar(), PyToken_TwoChars() or PyToken_ThreeChars() in Parser/tokenizer.c
* Lib/token.py (generated from Include/token.h)
* EXACT_TOKEN_TYPES in Lib/tokenize.py
* Operator, Bracket or Special in Lib/tokenize.py
* Doc/library/token.rst

It is possible to generate all this information from a single source. Proposed in [1] patch uses Lib/token.py as an initial source. But maybe Lib/token.py also should be generated from some file in general format? Some information can be derived from Grammar/Grammar, but not all. Needed also a mapping between token strings ('(' or '>=') and names (LPAR, GREATEREQUAL). Can this be added in Grammar/Grammar or a new file?

There is a related problem, the tokenize module uses three additional tokens not used by the C tokenizer. It modifies the content of the token module after importing it, that is not good. [2] One of solutions is making a copy of tok_names in tokenize before modifying it, but this doesn't work, because third-party code search tokenize constants in token.tok_names. Other solution is adding tokenize specific constants to the token module. Is this good to expose in the token module tokens not used in the C tokenizer?

Non-terminal symbols are generated automatically, Lib/symbol.py from Include/graminit.h, and Include/graminit.h and Python/graminit.c from Grammar/Grammar by Parser/pgen. Is it worth to generate Lib/symbol.py by pgen too? Can pgen be implemented in Python?

See also similar issue for opcodes. [3]

[1] https://bugs.python.org/issue30455
[2] https://bugs.python.org/issue25324
[3] https://bugs.python.org/issue17861

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to