On Mon, May 30, 2022 at 6:19 PM Benoit <[email protected]> wrote:
> Thanks Mario, very clear and useful ! I'll do that at parsing, as you > advise. Do you happen to remember in which file it is done in mmj2 ? I > skimmed through some files at > https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the > huge codebase. > It's here: https://github.com/digama0/mmj2/blob/master/src/mmj/mmio/Statementizer.java#L392-L393 At this point in the code, it has decided to read the next token, it gets back a string and checks it is not a keyword, and then it validates the token as a label (which checks both the label character restrictions as well as the banned statement names like 'aux', 'prn' etc, which is kind of interesting since this is only a "verify markup" lint in metamath-exe). I have a second question concerning the "second-stage parsing" that mmj2 > (and metamath-knife ?) do, but not metamath.c. Namely, this is the parsing > of the math expressions themselves. Do mmj2 and metamath-knife code their > own "parser generators" like Yacc ? As I see it, the program would look > for syntactic axioms (i.e., label $a w ..... $. where w has *already* > appeared in an $f statement), and from each of them it constructs a > production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff > -> '(' wff '/\' wff ')'. This is described in your "Models for Metamath". > Then, you hope that your CFG is not too complex and you give these rules > (all at a time at the end of parsing, or progressively as you read the file > ?), to a Yacc-like program, either external or one that you built (as is > the case for mmj2 and metamath-knife ?). This would be pretty impressive. > Can you point me to the rust file or function that would do that part ? > The mechanisms for doing this have evolved over time. Most systems that support math parsing implement what I would call a "dynamic general backtracking CFG parser", since this is definitely the easiest to code and understand and also good enough performance. MMJ2 has the EarleyParser for this, this is the original parser that was used before I started working on it. I think the cleanest implementation of this kind of parser is the one in mm-scala ( https://github.com/digama0/mm-scala/blob/master/src/main/scala/org/metamath/scala/Grammar.scala) or the haskell version which is part of mm0-hs ( https://github.com/digama0/mm0/blob/master/mm0-hs/src/MM0/FromMM/Parser.hs#L573-L610). Basically, rather than putting all the productions into a BNF and handing it to Yacc to generate a state machine, the compilation to state machine is done in the code itself. It's not so hard to do: each state has a map from constant token to next state, a map from variable sort to next state, and a possible reduction (proof tree fragment) to apply if this is a terminal node (or a list of reductions, although you should never have more than one reduction on a node if the grammar is unambiguous). Adding a new rule entails starting at the root and traversing all the symbols to get to the next state, creating a new empty state if needed, and then putting the provided reduction in the last node. This results in a tree of nodes (a trie) and you can navigate this trie to parse formulas. The backtracking part happens during "runtime", you have to consider all the ways that you can apply variable transitions to parse subformulas, which can be done by a recursive function. The other way to implement math parsing is using a variation on a LR(0) linear-time parser. This is more complex (assuming you at least want it to work on set.mm, which is not actually LR(0)), and it only works on a subset of unambiguous grammars, but it has the advantage of being O(n) instead of O(n^3) worst case for the backtracking parser. There are two implementations of this kind of parser: the LRParser in mmj2 and tirix's grammar parser in mm-knife. The disadvantage is that it doesn't really work very well as a dynamic parser: mmj2 has to parse the whole file, construct the parse tables and then use it on everything else. It's a lot closer to the "build a Yacc file" approach you mentioned. (As far as I know, no one has actually implemented something like that, although I have certainly considered it before.) It is broadly similar to the general parser approach with the nodes and transitions, but there is a deterministic rule for selecting the next state given each input token, and the state graph is expanded in certain places in order to ensure the correctness of the rule. Mario -- You received this message because you are subscribed to the Google Groups "Metamath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/CAFXXJSu-k0G5vJ230sT8eUL02qqR4r-_mLmU6HhMYmfdH%2B93Wg%40mail.gmail.com.
