On Sun, May 29, 2022 at 11:24 AM Benoit <[email protected]> wrote:
> I am writing a lexer and parser for Metamath databases, and I do not know > if it's better to check "string compliance" (e.g., questions like: is the > string "az%e" allowed to be a label?) at lexing or at parsing: > I think it is preferable to do the work during parsing. > At parsing: the lexer (skips whitespaces and comments and) returns tokens > which are the Metamath keywords as well as Ident(string) for anything > else. Then, the parser checks if Ident(string) satisfies the rules, e.g., > when being passed by the lexer the sequence '$[', Ident(file), '$]', it > will call a function is_valid_filename(file) which will check if 'file' > contains forbidden characters for a filename, like '%'. It is not very > satisfactory since there is some "rereading" of the characters of the > string 'file' (once at lexing and once at parsing). > The rereading is only necessary for labels (in declaration position), which are a relatively small proportion of tokens. Furthermore, it is only a "check" and does not need to create a new string: once you already know what class of token you expect you can check the rules for that token or list of tokens all at once. At lexing: one can tell if a string of characters is: > * a Metamath keyword; > * something which can be (and only be): > * a math symbol; > * a math symbol or label or filename; > * a math symbol or label or filename or complete compressed proof bloc; > * a math symbol or compressed proof bloc (complete or incomplete); > So the tokens given by the lexer are of one of these five kinds. This > ensures "unique reading" but is probably more cumbersome (even if not > directly coded from an automaton, the fact that the automaton has more > states will reflect in the code). > It will most likely slow the lexer down as well to have more classes, and you have to eliminate most of these classes most of the time. > Do you have any advice ? How is it done in metamath.c and in rust ? (I'm > not very proficient in these languages.) In mmverify.py, very few checks > are done, so the problem does not arise. > The relevant metamath.c code is: https://github.com/metamath/metamath-exe/blob/ccf08b62dec87de81576c96f1e8f09a96f796589/src/mmpars.c#L456-L460 . It looks like it first parses all the statements without validation, meaning that the token preceding a $p gets shoved into an array whatever it is, and then there is a second pass which goes over all these tokens (which are actually pointers into the original text), re-parses up to the space to get the length, and re-checks all the characters in between to ensure they are valid label characters. I'm not sure I would recommend this method, because of the late validation, not to mention the triple scan due to not storing token lengths (which is a classic C-ism). In metamath-knife, the relevant code is: https://github.com/david-a-wheeler/metamath-knife/blob/071292ef3a10350d0dae9e409b4921f91a08ce95/src/parser.rs#L314-L331 . The read_labels() function is called first at statement position before we even get to the keyword token. So this is accepting the grammar `label* $[acdefpv]`. Conceivably you could make this a one-pass implementation, but it just gets the token and then checks all the letters in the token in case $ is present (so we abort) or it contains label characters (or we error). The number of labels is restricted from any number to just 0 or 1 depending on the keyword later when the keyword is parsed (in get_statement()). Mario -- You received this message because you are subscribed to the Google Groups "Metamath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/CAFXXJSsvc_F3zw3sqfmOheLX0Wt-LTMvuiqq0WqY2cV9vXPA8g%40mail.gmail.com.
