Re: [Metamath] Lexing/parsing a Metamath database

Thierry Arnoux Mon, 30 May 2022 20:51:49 -0700

Hi Benoît,

For metamath-knife, I wrote that stage myself. I've called it "statement
parsing" and everything is self-contained in `grammar.rs`. I basically
started with a naive Moore machine, and then modified it to handle
missing cases and cover the family of syntax of set.mm. I can give more
explanation about how it works if you need.


For mmj2 I think Mel'O'Cat wrote it. It is an Earley parser, and you can
find it in `verify/EarleyParser.java`.

Both do two passes: a first pass where the grammar is constructed based
on the syntax axioms, and a second where each statement is parsed to
build parse trees (or AST, abstract syntax trees). So, that's roughly
the "parser generator" model you mention, even though no code is
generated like is Yacc. It might be possible to do everything in a
single pass, since metamath will only refer to previous definitions,
however for metamath-knife, the "parse statement" pass can be run in
parallel threads which all need an access to the grammar. While several
threads can share a read-only access, that would not work for a
read-write (mutable) access.

The grammar building phase in metamath-knife is blazingly fast, it runs
in 4-6ms on my 7-year old laptop. The reason is that it only cares about
syntactic axioms, and there's only a few hundreds of them. Statement
parsing then takes around 600ms on my laptop. Memory for every parse
tree needs to be allocated, and then possibly grown, so I'm sure there
is space for improvement here.

Glauco said he was using a 3rd party Earley parser, `nearly.js`. If it
can cover set.mm that's probably also your best option, and in
hindsight, I shall probably also have done the same in Rust!

BR,
_
Thierry

On 31/05/2022 06:19, Benoit wrote:

Thanks Mario, very clear and useful !  I'll do that at parsing, as you
advise.  Do you happen to remember in which file it is done in mmj2 ? 
I skimmed through some files at
https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in
the huge codebase.

I have a second question concerning the "second-stage parsing" that
mmj2 (and metamath-knife ?) do, but not metamath.c.  Namely, this is
the parsing of the math expressions themselves.  Do mmj2 and
metamath-knife code their own "parser generators" like Yacc ?  As I
see it, the program would look for syntactic axioms (i.e., label $a w
..... $.  where w has *already* appeared in an $f statement), and from
each of them it constructs a production rule, for instance, $a wff (
ph /\ ps ) $. yields the rule wff -> '(' wff '/\' wff ')'.  This is
described in your "Models for Metamath".  Then, you hope that your CFG
is not too complex and you give these rules (all at a time at the end
of parsing, or progressively as you read the file ?), to a Yacc-like
program, either external or one that you built (as is the case for
mmj2 and metamath-knife ?).  This would be pretty impressive.  Can you
point me to the rust file or function that would do that part ?

Thanks,
Benoît


On Monday, May 30, 2022 at 7:59:58 AM UTC+2 [email protected] wrote:

    On Sun, May 29, 2022 at 11:24 AM Benoit <[email protected]> wrote:

        I am writing a lexer and parser for Metamath databases, and I
        do not know if it's better to check "string compliance" (e.g.,
        questions like: is the string "az%e" allowed to be a label?)
        at lexing or at parsing:


    I think it is preferable to do the work during parsing.

        At parsing: the lexer (skips whitespaces and comments and)
        returns tokens which are the Metamath keywords as well as
        Ident(string) for anything else.  Then, the parser checks if
        Ident(string) satisfies the rules, e.g., when being passed by
        the lexer the sequence '$[', Ident(file), '$]', it will call a
        function is_valid_filename(file) which will check if 'file'
        contains forbidden characters for a filename, like '%'.  It is
        not very satisfactory since there is some "rereading" of the
        characters of the string 'file' (once at lexing and once at
        parsing).


    The rereading is only necessary for labels (in declaration
    position), which are a relatively small proportion of tokens.
    Furthermore, it is only a "check" and does not need to create a
    new string: once you already know what class of token you expect
    you can check the rules for that token or list of tokens all at once.

        At lexing: one can tell if a string of characters is:
        * a Metamath keyword;
        * something which can be (and only be):
           * a math symbol;
           * a math symbol or label or filename;
           * a math symbol or label or filename or complete compressed
        proof bloc;
           * a math symbol or compressed proof bloc (complete or
        incomplete);
        So the tokens given by the lexer are of one of these five
        kinds.  This ensures "unique reading" but is probably more
        cumbersome (even if not directly coded from an automaton, the
        fact that the automaton has more states will reflect in the code).


    It will most likely slow the lexer down as well to have more
    classes, and you have to eliminate most of these classes most of
    the time.

        Do you have any advice ?  How is it done in metamath.c and in
        rust ? (I'm not very proficient in these languages.)  In
        mmverify.py, very few checks are done, so the problem does not
        arise.


    The relevant metamath.c code is:
    
https://github.com/metamath/metamath-exe/blob/ccf08b62dec87de81576c96f1e8f09a96f796589/src/mmpars.c#L456-L460
    .
    It looks like it first parses all the statements without
    validation, meaning that the token preceding a $p gets shoved into
    an array whatever it is, and then there is a second pass which
    goes over all these tokens (which are actually pointers into the
    original text), re-parses up to the space to get the length, and
    re-checks all the characters in between to ensure they are valid
    label characters. I'm not sure I would recommend this method,
    because of the late validation, not to mention the triple scan due
    to not storing token lengths (which is a classic C-ism).

    In metamath-knife, the relevant code is:
    
https://github.com/david-a-wheeler/metamath-knife/blob/071292ef3a10350d0dae9e409b4921f91a08ce95/src/parser.rs#L314-L331
    .
    The read_labels() function is called first at statement position
    before we even get to the keyword token. So this is accepting the
    grammar `label* $[acdefpv]`. Conceivably you could make this a
    one-pass implementation, but it just gets the token and then
    checks all the letters in the token in case $ is present (so we
    abort) or it contains label characters (or we error). The number
    of labels is restricted from any number to just 0 or 1 depending
    on the keyword later when the keyword is parsed (in get_statement()).

    Mario

--
You received this message because you are subscribed to the Google
Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com
<https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/908c0896-11b7-d0c0-df29-33b1bef8ad80%40gmx.net.

Re: [Metamath] Lexing/parsing a Metamath database

Reply via email to