Thanks Mario, very clear and useful ! I'll do that at parsing, as you
advise. Do you happen to remember in which file it is done in mmj2 ? I
skimmed through some files at
https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the
huge codebase.
I have a second question concerning the "second-stage parsing" that mmj2
(and metamath-knife ?) do, but not metamath.c. Namely, this is the parsing
of the math expressions themselves. Do mmj2 and metamath-knife code their
own "parser generators" like Yacc ? As I see it, the program would look
for syntactic axioms (i.e., label $a w ..... $. where w has *already*
appeared in an $f statement), and from each of them it constructs a
production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff
-> '(' wff '/\' wff ')'. This is described in your "Models for Metamath".
Then, you hope that your CFG is not too complex and you give these rules
(all at a time at the end of parsing, or progressively as you read the file
?), to a Yacc-like program, either external or one that you built (as is
the case for mmj2 and metamath-knife ?). This would be pretty impressive.
Can you point me to the rust file or function that would do that part ?
Thanks,
BenoƮt
On Monday, May 30, 2022 at 7:59:58 AM UTC+2 [email protected] wrote:
> On Sun, May 29, 2022 at 11:24 AM Benoit <[email protected]> wrote:
>
>> I am writing a lexer and parser for Metamath databases, and I do not know
>> if it's better to check "string compliance" (e.g., questions like: is the
>> string "az%e" allowed to be a label?) at lexing or at parsing:
>>
>
> I think it is preferable to do the work during parsing.
>
>
>> At parsing: the lexer (skips whitespaces and comments and) returns tokens
>> which are the Metamath keywords as well as Ident(string) for anything
>> else. Then, the parser checks if Ident(string) satisfies the rules, e.g.,
>> when being passed by the lexer the sequence '$[', Ident(file), '$]', it
>> will call a function is_valid_filename(file) which will check if 'file'
>> contains forbidden characters for a filename, like '%'. It is not very
>> satisfactory since there is some "rereading" of the characters of the
>> string 'file' (once at lexing and once at parsing).
>>
>
> The rereading is only necessary for labels (in declaration position),
> which are a relatively small proportion of tokens. Furthermore, it is only
> a "check" and does not need to create a new string: once you already know
> what class of token you expect you can check the rules for that token or
> list of tokens all at once.
>
> At lexing: one can tell if a string of characters is:
>> * a Metamath keyword;
>> * something which can be (and only be):
>> * a math symbol;
>> * a math symbol or label or filename;
>> * a math symbol or label or filename or complete compressed proof bloc;
>> * a math symbol or compressed proof bloc (complete or incomplete);
>> So the tokens given by the lexer are of one of these five kinds. This
>> ensures "unique reading" but is probably more cumbersome (even if not
>> directly coded from an automaton, the fact that the automaton has more
>> states will reflect in the code).
>>
>
> It will most likely slow the lexer down as well to have more classes, and
> you have to eliminate most of these classes most of the time.
>
>
>> Do you have any advice ? How is it done in metamath.c and in rust ?
>> (I'm not very proficient in these languages.) In mmverify.py, very few
>> checks are done, so the problem does not arise.
>>
>
> The relevant metamath.c code is:
> https://github.com/metamath/metamath-exe/blob/ccf08b62dec87de81576c96f1e8f09a96f796589/src/mmpars.c#L456-L460
>
> .
> It looks like it first parses all the statements without validation,
> meaning that the token preceding a $p gets shoved into an array whatever it
> is, and then there is a second pass which goes over all these tokens (which
> are actually pointers into the original text), re-parses up to the space to
> get the length, and re-checks all the characters in between to ensure they
> are valid label characters. I'm not sure I would recommend this method,
> because of the late validation, not to mention the triple scan due to not
> storing token lengths (which is a classic C-ism).
>
> In metamath-knife, the relevant code is:
> https://github.com/david-a-wheeler/metamath-knife/blob/071292ef3a10350d0dae9e409b4921f91a08ce95/src/parser.rs#L314-L331
>
> .
> The read_labels() function is called first at statement position before we
> even get to the keyword token. So this is accepting the grammar `label*
> $[acdefpv]`. Conceivably you could make this a one-pass implementation, but
> it just gets the token and then checks all the letters in the token in case
> $ is present (so we abort) or it contains label characters (or we error).
> The number of labels is restricted from any number to just 0 or 1 depending
> on the keyword later when the keyword is parsed (in get_statement()).
>
> Mario
>
--
You received this message because you are subscribed to the Google Groups
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com.