[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

David J W Thu, 03 Nov 2022 11:09:01 -0700

Following up, Pablo spotted my problem with the mixup of NL & NEWLINE
tokens.  I was using tokenize.py in cPython's stdlib with a simple python
script to build ridiculously strict unit tests.


My solution to that problem was originally to figure out how to access
cPython's internal c tokenizer but someone else did that in 3.11.   The
parser is passing basic tests but I need to redo all of the tests for my
tokenizer as they are flawed and also do some major housekeeping to clean
up all the warnings and TODO's sprinkled throughout my code base.

To hopefully avoid future problems, is Lib/symtable.py trustworthy as a way
of building unit tests when I start implementing my own symbols graph/table?


Thanks,
    David



On Wed, Oct 26, 2022 at 11:57 PM Matthieu Dartiailh <[email protected]>
wrote:

> If you look at pegen, that uses the stdlib tokenizer as input, you will
> see that the obejct us3d to implement memoization on top of a token stream
> simply swallow NL (
> https://github.com/we-like-parsers/pegen/blob/main/src/pegen/tokenizer.py#L49).
> This is safe since NL has no syntactic meaning only NEWLINE does.
>
> Best
>
> Matthieu
>
> On Thu, Oct 27, 2022, 01:59 Matthias Görgens <[email protected]>
> wrote:
>
>> Hi David,
>>
>> Could you share what you have so far, perhaps ok GitHub or so? That way
>> it's easier to diagnose your problems. I'm reasonably familiar with Rust.
>>
>> Perhaps also add a minimal crashing example?
>>
>> Cheers,
>> Matthias.
>>
>> On Thu, 27 Oct 2022, 04:52 David J W, <[email protected]> wrote:
>>
>>> Pablo,
>>>     Nl and Newline are tokens but I am interested in NEWLINE's behavior
>>> in the Python grammar, note the casing.
>>>
>>> For example in simple_stmts @
>>> https://github.com/python/cpython/blob/main/Grammar/python.gram#L107
>>>
>>> Is that NEWLINE some sort of built in rule to the grammar?   In my
>>> project I am running into problems where the parser crashes any time there
>>> is some double like NL & N or Newline & NL but I want to nail down
>>> NEWLINE's behavior in CPython's PEG grammar.
>>>
>>> On Wed, Oct 26, 2022 at 12:51 PM Pablo Galindo Salgado <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am not sure I understand exactly what you are asking but NEWLINE is a
>>>> token, not a parser rule. What decides when NEWLINE is emitted is the lexer
>>>> that has nothing to do with PEG. Normally PEG parsers also acts as
>>>> tokenizers but the one in cpython does not.
>>>>
>>>> Also notice that CPython’s parser uses a version of the tokeniser
>>>> written in C that doesn’t share code with the exposed version. You will
>>>> find that the tokenizer module in the standard library actually behaves
>>>> differently regarding what tokens are emitted in new lines and 
>>>> indentations.
>>>>
>>>> The only way to be sure is check the code unfortunately.
>>>>
>>>> Hope this helps.
>>>>
>>>> Regards from rainy London,
>>>> Pablo Galindo Salgado
>>>>
>>>> > On 26 Oct 2022, at 19:12, David J W <[email protected]> wrote:
>>>> >
>>>> > 
>>>> > I am writing a Rust version of Python for fun and I am at the parser
>>>> stage of development.
>>>> >
>>>> > I copied and modified a PEG grammar ruleset from another open source
>>>> project and I've already noticed some problems (ex Newline vs NL) with how
>>>> they transcribed things.
>>>> >
>>>> > I am suspecting that CPython's grammar NEWLINE is a builtin rule for
>>>> the parser that is something like `(Newline+ | NL+ ) {NOP}` but wanted to
>>>> sanity check if that is right before I figure out how to hack in a NEWLINE
>>>> rule and update my grammar ruleset.
>>>> > _______________________________________________
>>>> > Python-Dev mailing list -- [email protected]
>>>> > To unsubscribe send an email to [email protected]
>>>> > https://mail.python.org/mailman3/lists/python-dev.python.org/
>>>> > Message archived at
>>>> https://mail.python.org/archives/list/[email protected]/message/NMCMEDMEBKATYKRNZLX2NDGFOB5UHQ5A/
>>>> > Code of Conduct: http://python.org/psf/codeofconduct/
>>>>
>>> _______________________________________________
>>> Python-Dev mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>>> Message archived at
>>> https://mail.python.org/archives/list/[email protected]/message/LTDXZ4DS2GLICZRWYZ5PVLPBJHVGQPSS/
>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>
>> _______________________________________________
>> Python-Dev mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/[email protected]/message/ZZDKWS62QG3BTNIT2NYRCLRI4VJ2HBF6/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>

_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/TBWLY6AEAV6BLJDO2UHCHG5F7YLLVTQT/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

Reply via email to