[Python-Dev] Re: PEP 617: New PEG parser for CPython

Guido van Rossum Sat, 18 Apr 2020 21:42:21 -0700

On Sat, Apr 18, 2020 at 4:53 PM Carl Meyer <c...@oddbird.net> wrote:

> The PEP is exciting and is very clearly presented, thank you all for
> the hard work!
>
> Considering the comments in the PEP about the new parser not
> preserving a parse tree or CST, I have some questions about the future
> options for Python language-services tooling which requires a CST in
> order to round-trip and modify Python code. Examples in this space
> include auto-formatters, refactoring tools, linters with autofix, etc.
> Today many such tools (e.g. Black, 2to3) are based on lib2to3. Other
> tools already have their own parser (e.g. LibCST -- which I help
> maintain -- and Jedi both use parso, a fork of pgen2).
>


Right, LibCST is very exciting. Note that AFAIK none of the tools you
mention depend on the old parser module. (Though I'm not denying that there
might be tools depending on it -- that's why we're keeping it until 3.10.)


> 1) 2to3 and lib2to3 are not mentioned in the PEP, but are a documented
> part of the standard library used by some very popular tools, and
> currently depend on pgen2. A quick search of the PEP 617 pull request
> does not suggest that it modifies lib2to3. Will lib2to3 also be
> removed in Python 3.10 along with the old parser? It might be good for
> the PEP to address the future of 2to3 and lib2to3 explicitly.
>

Note that, while there is indeed a docs page about 2to3
<https://docs.python.org/3/library/2to3.html>, the only docs for *lib2to3*
in the standard library reference are a link to the source code and a
single "*Note:* The lib2to3
<https://docs.python.org/3/library/2to3.html?highlight=lib2to3#module-lib2to3>
API should be considered unstable and may change drastically in the future."

Fortunately,  in order to support the 2to3 application, lib2to3 doesn't
need to change, because the syntax of Python 2 is no longer changing. :-)
Choosing to remove 2to3 is an independent decision. And lib2to3 does not
depend in any way on the old parser module. (It doesn't even use the
standard tokenize module, but incorporates its own version that is slightly
tweaked to support Python 2.)


> 2) As these tools make the necessary adaptations to support Python
> 3.10, which may no longer be parsable with an LL(1) parser, will we be
> able to leverage any part of pegen to construct a lossless Python CST,
> or will we likely need to fork pegen outside of CPython or build a
> wholly new parser? It would be neat if an alternate grammar could be
> written in pegen that has access to all tokens (including NL and
> COMMENT) for this purpose; that would save a lot of code duplication
> and potential for inconsistency. I haven't had a chance to fully read
> through the PEP 617 pull request, but it looks like its tokenizer
> wrapper currently discards NL and COMMENT. I understand this is a
> distinct use case with distinct needs and I'm not suggesting that we
> should make significant sacrifices in the performance or
> maintainability of pegen to serve it, but if it's possible to enable
> some sharing by making API choices now before it's merged, that seems
> worth considering.
>

You've mentioned a few different tools that already use different
technologies: LibCST depends on parso which has a fork of pgen2, lib2to3
which has the original pgen2. I wonder if this would be an opportunity to
move such parsing support out of the standard library completely. There are
already two versions of pegen, but neither is in the standard library:
there is the original pegen <https://github.com/gvanrossum/pegen/> repo
which is where things started, and there is a fork of that code in the CPython
Tools
<https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator>
directory (not yet in the upstream repo, but see PR 19503
<https://github.com/python/cpython/pull/19503>).

The pegen tool has two generators, one generating C code and one generating
Python code. I think that the C generator is really only relevant for
CPython itself: it relies on the builtin tokenizer (the one written in C,
not the stdlib tokenize.py) and the generated C code depends on many
internal APIs. In fact the C generator in the original pegen repo doesn't
work with Python 3.9 because those internal APIs are no longer exported.
(It also doesn't work with Python 3.7 or older because it makes critical
use of the walrus operator. :-) Also, once we started getting serious about
replacing the old parser, we worked exclusively on the C generator in the
CPython Tools directory, so the version in the original pegen repo is
lagging quite a bit behind (is is the Python grammar in that repo). But as
I said you're not gonna need it.

On the other hand, the Python generator is designed to be flexible, and
while it defaults to using the stdlib tokenize.py tokenizer, you can easily
hook up your own. Putting this version in the stdlib would be a mistake,
because the code is pretty immature; it is really waiting for a good home,
and if parso or LibCST were to decide to incorporate a fork of it and
develop it into a high quality parser generator for Python-like languages
that would be great. I wouldn't worry much about the duplication of code --
the Python generator in the CPython Tools directory is only used for one
purpose, and that is to produce the meta-parser (the parser for grammars)
from the meta-grammar. And I would happily stop developing the original
pegen once a fork is being developed.

Another option would be to just improve the python generator in the
original pegen repo to satisfy the needs of parso and LibCST. Reading the
blurb for parso it looks like it really just parses *Python*, which is less
ambitious than pegen. But it also seems to support error recovery, which
currently isn't part of pegen. (However, we've thought
<https://github.com/we-like-parsers/cpython/issues/84> about it.) Anyway,
regardless of how exactly this is structured someone will probably have to
take over development and support. Pegen started out as a hobby project to
educate myself about PEG parsers. Then I wrote a bunch of blog posts about
my approach, and finally I started working on using it to generate a
replacement for the old pgen-based parser. But I never found the time to
make it an appealing parser generator tool for other languages, even though
that was on my mind as a future possibility. It will take some time to
disentangle all this, and I'd be happy to help someone who wants to work on
this.

Finally, I should recognize the important influence of my mentor in PEG
parsing, Juancarlo Añez <https://github.com/apalala/>. Without his early
encouragement and advice I would never have been able to travel this road.

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QJOJLGQTIZU3MSU5I4PKK75W7RD5MH5X/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 617: New PEG parser for CPython

Reply via email to