Re: [Metamath] Lexing/parsing a Metamath database

Benoit Mon, 30 May 2022 15:19:11 -0700

Thanks Mario, very clear and useful !  I'll do that at parsing, as you 
advise.  Do you happen to remember in which file it is done in mmj2 ?  I 
skimmed through some files at 
https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the 
huge codebase.


I have a second question concerning the "second-stage parsing" that mmj2 
(and metamath-knife ?) do, but not metamath.c.  Namely, this is the parsing 
of the math expressions themselves.  Do mmj2 and metamath-knife code their 
own "parser generators" like Yacc ?  As I see it, the program would look 
for syntactic axioms (i.e., label $a w ..... $.  where w has *already* 
appeared in an $f statement), and from each of them it constructs a 
production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff 
-> '(' wff '/\' wff ')'.  This is described in your "Models for Metamath".  
Then, you hope that your CFG is not too complex and you give these rules 
(all at a time at the end of parsing, or progressively as you read the file 
?), to a Yacc-like program, either external or one that you built (as is 
the case for mmj2 and metamath-knife ?).  This would be pretty impressive.  
Can you point me to the rust file or function that would do that part ?

Thanks,
Benoît

 

On Monday, May 30, 2022 at 7:59:58 AM UTC+2 [email protected] wrote:

> On Sun, May 29, 2022 at 11:24 AM Benoit <[email protected]> wrote:
>
>> I am writing a lexer and parser for Metamath databases, and I do not know 
>> if it's better to check "string compliance" (e.g., questions like: is the 
>> string "az%e" allowed to be a label?) at lexing or at parsing:
>>
>
> I think it is preferable to do the work during parsing.
>  
>
>> At parsing: the lexer (skips whitespaces and comments and) returns tokens 
>> which are the Metamath keywords as well as Ident(string) for anything 
>> else.  Then, the parser checks if Ident(string) satisfies the rules, e.g., 
>> when being passed by the lexer the sequence '$[', Ident(file), '$]', it 
>> will call a function is_valid_filename(file) which will check if 'file' 
>> contains forbidden characters for a filename, like '%'.  It is not very 
>> satisfactory since there is some "rereading" of the characters of the 
>> string 'file' (once at lexing and once at parsing).
>>
>
> The rereading is only necessary for labels (in declaration position), 
> which are a relatively small proportion of tokens. Furthermore, it is only 
> a "check" and does not need to create a new string: once you already know 
> what class of token you expect you can check the rules for that token or 
> list of tokens all at once.
>
> At lexing: one can tell if a string of characters is:
>> * a Metamath keyword;
>> * something which can be (and only be):
>>    * a math symbol;
>>    * a math symbol or label or filename;
>>    * a math symbol or label or filename or complete compressed proof bloc;
>>    * a math symbol or compressed proof bloc (complete or incomplete);
>> So the tokens given by the lexer are of one of these five kinds.  This 
>> ensures "unique reading" but is probably more cumbersome (even if not 
>> directly coded from an automaton, the fact that the automaton has more 
>> states will reflect in the code).
>>
>
> It will most likely slow the lexer down as well to have more classes, and 
> you have to eliminate most of these classes most of the time.
>  
>
>> Do you have any advice ?  How is it done in metamath.c and in rust ?  
>> (I'm not very proficient in these languages.)  In mmverify.py, very few 
>> checks are done, so the problem does not arise.
>>
>
> The relevant metamath.c code is: 
> https://github.com/metamath/metamath-exe/blob/ccf08b62dec87de81576c96f1e8f09a96f796589/src/mmpars.c#L456-L460
>  
> .
> It looks like it first parses all the statements without validation, 
> meaning that the token preceding a $p gets shoved into an array whatever it 
> is, and then there is a second pass which goes over all these tokens (which 
> are actually pointers into the original text), re-parses up to the space to 
> get the length, and re-checks all the characters in between to ensure they 
> are valid label characters. I'm not sure I would recommend this method, 
> because of the late validation, not to mention the triple scan due to not 
> storing token lengths (which is a classic C-ism).
>
> In metamath-knife, the relevant code is: 
> https://github.com/david-a-wheeler/metamath-knife/blob/071292ef3a10350d0dae9e409b4921f91a08ce95/src/parser.rs#L314-L331
>  
> .
> The read_labels() function is called first at statement position before we 
> even get to the keyword token. So this is accepting the grammar `label* 
> $[acdefpv]`. Conceivably you could make this a one-pass implementation, but 
> it just gets the token and then checks all the letters in the token in case 
> $ is present (so we abort) or it contains label characters (or we error). 
> The number of labels is restricted from any number to just 0 or 1 depending 
> on the keyword later when the keyword is parsed (in get_statement()).
>
> Mario
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com.

Re: [Metamath] Lexing/parsing a Metamath database

Reply via email to