On Mon, May 30, 2022 at 6:19 PM Benoit <[email protected]> wrote:

> Thanks Mario, very clear and useful !  I'll do that at parsing, as you
> advise.  Do you happen to remember in which file it is done in mmj2 ?  I
> skimmed through some files at
> https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the
> huge codebase.
>

It's here:
https://github.com/digama0/mmj2/blob/master/src/mmj/mmio/Statementizer.java#L392-L393
At this point in the code, it has decided to read the next token, it gets
back a string and checks it is not a keyword, and then it validates the
token as a label (which checks both the label character restrictions as
well as the banned statement names like 'aux', 'prn' etc, which is kind of
interesting since this is only a "verify markup" lint in metamath-exe).

I have a second question concerning the "second-stage parsing" that mmj2
> (and metamath-knife ?) do, but not metamath.c.  Namely, this is the parsing
> of the math expressions themselves.  Do mmj2 and metamath-knife code their
> own "parser generators" like Yacc ?  As I see it, the program would look
> for syntactic axioms (i.e., label $a w ..... $.  where w has *already*
> appeared in an $f statement), and from each of them it constructs a
> production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff
> -> '(' wff '/\' wff ')'.  This is described in your "Models for Metamath".
> Then, you hope that your CFG is not too complex and you give these rules
> (all at a time at the end of parsing, or progressively as you read the file
> ?), to a Yacc-like program, either external or one that you built (as is
> the case for mmj2 and metamath-knife ?).  This would be pretty impressive.
> Can you point me to the rust file or function that would do that part ?
>

The mechanisms for doing this have evolved over time. Most systems that
support math parsing implement what I would call a "dynamic general
backtracking CFG parser", since this is definitely the easiest to code and
understand and also good enough performance. MMJ2 has the EarleyParser for
this, this is the original parser that was used before I started working on
it. I think the cleanest implementation of this kind of parser is the one
in mm-scala (
https://github.com/digama0/mm-scala/blob/master/src/main/scala/org/metamath/scala/Grammar.scala)
or the haskell version which is part of mm0-hs (
https://github.com/digama0/mm0/blob/master/mm0-hs/src/MM0/FromMM/Parser.hs#L573-L610).
Basically, rather than putting all the productions into a BNF and handing
it to Yacc to generate a state machine, the compilation to state machine is
done in the code itself.

It's not so hard to do: each state has a map from constant token to next
state, a map from variable sort to next state, and a possible reduction
(proof tree fragment) to apply if this is a terminal node (or a list of
reductions, although you should never have more than one reduction on a
node if the grammar is unambiguous). Adding a new rule entails starting at
the root and traversing all the symbols to get to the next state, creating
a new empty state if needed, and then putting the provided reduction in the
last node. This results in a tree of nodes (a trie) and you can navigate
this trie to parse formulas. The backtracking part happens during
"runtime", you have to consider all the ways that you can apply variable
transitions to parse subformulas, which can be done by a recursive function.

The other way to implement math parsing is using a variation on a LR(0)
linear-time parser. This is more complex (assuming you at least want it to
work on set.mm, which is not actually LR(0)), and it only works on a subset
of unambiguous grammars, but it has the advantage of being O(n) instead of
O(n^3) worst case for the backtracking parser. There are two
implementations of this kind of parser: the LRParser in mmj2 and tirix's
grammar parser in mm-knife. The disadvantage is that it doesn't really work
very well as a dynamic parser: mmj2 has to parse the whole file, construct
the parse tables and then use it on everything else. It's a lot closer to
the "build a Yacc file" approach you mentioned. (As far as I know, no one
has actually implemented something like that, although I have certainly
considered it before.) It is broadly similar to the general parser approach
with the nodes and transitions, but there is a deterministic rule for
selecting the next state given each input token, and the state graph is
expanded in certain places in order to ensure the correctness of the rule.

Mario

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJSu-k0G5vJ230sT8eUL02qqR4r-_mLmU6HhMYmfdH%2B93Wg%40mail.gmail.com.

Reply via email to