On Sun, May 29, 2022 at 11:24 AM Benoit <[email protected]> wrote:

> I am writing a lexer and parser for Metamath databases, and I do not know
> if it's better to check "string compliance" (e.g., questions like: is the
> string "az%e" allowed to be a label?) at lexing or at parsing:
>

I think it is preferable to do the work during parsing.


> At parsing: the lexer (skips whitespaces and comments and) returns tokens
> which are the Metamath keywords as well as Ident(string) for anything
> else.  Then, the parser checks if Ident(string) satisfies the rules, e.g.,
> when being passed by the lexer the sequence '$[', Ident(file), '$]', it
> will call a function is_valid_filename(file) which will check if 'file'
> contains forbidden characters for a filename, like '%'.  It is not very
> satisfactory since there is some "rereading" of the characters of the
> string 'file' (once at lexing and once at parsing).
>

The rereading is only necessary for labels (in declaration position), which
are a relatively small proportion of tokens. Furthermore, it is only a
"check" and does not need to create a new string: once you already know
what class of token you expect you can check the rules for that token or
list of tokens all at once.

At lexing: one can tell if a string of characters is:
> * a Metamath keyword;
> * something which can be (and only be):
>    * a math symbol;
>    * a math symbol or label or filename;
>    * a math symbol or label or filename or complete compressed proof bloc;
>    * a math symbol or compressed proof bloc (complete or incomplete);
> So the tokens given by the lexer are of one of these five kinds.  This
> ensures "unique reading" but is probably more cumbersome (even if not
> directly coded from an automaton, the fact that the automaton has more
> states will reflect in the code).
>

It will most likely slow the lexer down as well to have more classes, and
you have to eliminate most of these classes most of the time.


> Do you have any advice ?  How is it done in metamath.c and in rust ?  (I'm
> not very proficient in these languages.)  In mmverify.py, very few checks
> are done, so the problem does not arise.
>

The relevant metamath.c code is:
https://github.com/metamath/metamath-exe/blob/ccf08b62dec87de81576c96f1e8f09a96f796589/src/mmpars.c#L456-L460
.
It looks like it first parses all the statements without validation,
meaning that the token preceding a $p gets shoved into an array whatever it
is, and then there is a second pass which goes over all these tokens (which
are actually pointers into the original text), re-parses up to the space to
get the length, and re-checks all the characters in between to ensure they
are valid label characters. I'm not sure I would recommend this method,
because of the late validation, not to mention the triple scan due to not
storing token lengths (which is a classic C-ism).

In metamath-knife, the relevant code is:
https://github.com/david-a-wheeler/metamath-knife/blob/071292ef3a10350d0dae9e409b4921f91a08ce95/src/parser.rs#L314-L331
.
The read_labels() function is called first at statement position before we
even get to the keyword token. So this is accepting the grammar `label*
$[acdefpv]`. Conceivably you could make this a one-pass implementation, but
it just gets the token and then checks all the letters in the token in case
$ is present (so we abort) or it contains label characters (or we error).
The number of labels is restricted from any number to just 0 or 1 depending
on the keyword later when the keyword is parsed (in get_statement()).

Mario

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJSsvc_F3zw3sqfmOheLX0Wt-LTMvuiqq0WqY2cV9vXPA8g%40mail.gmail.com.

Reply via email to