On Mon, Aug 16, 2021 at 11:56 PM Thierry Arnoux <[email protected]>
wrote:
> Hi Mario,
>
> Indeed set.mm's grammar is unambiguous, and as you write, these are not
> true ambiguities, but just prefixes which my initial naive implementation
> did not parse correctly. The concept of "garden-path" was new to me, but
> describes well this problem: do you think I shall rename the keyword?
>
I'm not too fussed about the name of the keyword, although if it is
something specific to your algorithm then it should probably be in the
keyword.
> Unfortunately the way I wrote the parser, these are not detected naturally
> when building the grammar, and I don't see how they could. The first pass
> is simply adds all syntactic axioms to the machine. Each different
> construct is added independently, they don't interfere with each other. In
> the example of my last mail, there are 3 different paths, " *setvar* *e.*
> *setvar* ", " *(* *wff* */\* *wff* *)* ", and " *(* *setvar* *e.* *class*
> *|->* *class* *)* ". Even if the last two share " *(* " as a common
> prefix, they differ afterwards because one is followed by a *setvar*,
> while the other is followed by a *class*, so the state machine goes to
> diverging states at that point.
>
I completely agree that it would have been better if the parser could have
> figured them out purely based on the indication that the grammar is KLR(5).
> The way I initially though about handling that was to let the parser
> consider each possible combination of production against each production,
> and see if there was a conflict. However this would have been difficult to
> implement and computationally very intensive. Giving those hints was the
> best compromise I could find. I'd be happy if you or anyone else wants to
> have a closer look at the algorithm and propose improvements! In any case,
> I'll try to write down detailed explanations so that it is not only in
> emails.
>
I have a description of the KLR algorithm from a discussion with Stan Polu
which I may or may not have posted here before. I'll put it at the bottom
of this message. To summarize, it uses the standard LR(0) parse table
generation algorithm to produce the tree of states. Doing this naively
results in shift-reduce conflicts, and these conflicts are resolved by
compositing the rule to be reduced so that both sides become shifts.
> You have a valid point about the risk of breakage when a new syntax is
> introduced. But these prefix ambiguities (or "garden paths") arose in
> set.mm only with basic constructs like class builders and maps-to
> notations, which we don't introduce anymore, or extremely rarely: df-opab
> was introduced in 1994, df-oprab in 1995, and df-mpt in 2008, and never
> modified since. So the introduction of such garden paths is
> once-in-a-decade event :-)
>
> The meaning of the parser command " ambiguous_prefix X => Y " is "X
> shadows Y", i.e., one may initially mistakenly read "X", although "Y" might
> also be valid for the same prefix. "X" is always a prefix of a valid
> set.mm expression, like " ( x e. A " is a prefix for a mapping-to
> expression " ( x e. A |-> B ) ".
>
> As for the syntax of the parser commands, they could also have been put in
> quoted strings. The (MM) parser unquotes the string if it finds quotes, and
> you can escape quotes, so that the unquoted ( is equivalent with the quoted
> '('. For a semicolon, you could write ';', and '\'' for a single quote. The
> way you write them also works.
>
Is that parsing specific to the ambiguous_prefix command? Because I would
prefer for the outer $j / $t parser to not have to be too smart and know
all the commands (especially since $j commands are specifically specced to
be ignorable if the parser doesn't know the head keyword). Since spaces are
optional inside $j / $t blocks (at least around `;`), you need to be able
to lex properly, and allowing arbitrary tokens would make this difficult.
That is, I'm advocating for a restrictive outer grammar where everything is
identifiers and strings, and then you can do some inner parsing specific to
the ambiguous_prefix command if you want.
===========================================
[This](
https://github.com/digama0/mmj2/blob/master/src/mmj/verify/LRParser.java#L151-L208)
is the new part of the code that distinguishes the KLR parser from LR(0). A
"conflict" is a place where an LR(0) parser would fail outright.
During parse table generation each state is associated with a bunch of
partially read productions that agree on a common prefix, and in this case
you are stuck at the state:
```
-> { o <. setvar , setvar >. | wff }
-> { o class }
```
This is known as a shift-reduce conflict, and usually shifting is the right
answer, so that's built in as a heuristic, which is why lark takes the
first option over the second. But neither choice is "correct", because this
changes the grammar - you are now rejecting a string that should be valid
to the grammar (`{ <. x , y >. }` in this case) - so essentially you are
finding a "closest LALR(1) approximation" to the grammar when you use lark
with this heuristic.
To resolve this, the question is what to do from that state if you read a
`<.`. We haven't actually hit the conflict yet. In the first production
it's clear that we should step to `-> { <. o setvar , setvar >. } | ph }`,
but the second production requires us to look at the `class` nonterminals
that start with `<.`. (In fact, in the first state we also have all
productions for the class nonterminal, like `-> o 1` and `-> o ( class u.
class )` and many others. These are represented in a special way in
LRParser.java to save space.) Let's step through the states that the
example takes. The `shift <.` step takes us to:
```
-> { <. o setvar , setvar >. | wff }
-> <. o class , class >.
all setvar -> o rules
all class -> o rules
```
and `shift x` takes us to:
```
-> x o
```
Since we are now at the end of a production, we can reduce with `setvar ->
x` at this point, and there are no competing productions so this is safe.
This `reduce x` edge pops the stack and acts like a `shift setvar` edge
from the previous step, leading to:
```
-> { <. setvar o , setvar >. | wff }
-> setvar o
```
The `-> setvar o` comes from the `class -> setvar` production. Now we are
stuck, because we can both reduce with this production (which gives us a
`cv` node) and shift a comma to continue with the first production. This is
a shift-reduce conflict, and lark at this point will throw away the reduce
option and shift here, leading to
```
-> { <. setvar , o setvar >. | wff }
all setvar -> o rules
```
which is not correct, as we have lost the ability to parse `{ <. x , y >.
}`.
What I do instead to resolve this is "pre-compositing" the rules. We first
got in trouble at
```
-> { <. setvar o , setvar >. | wff }
-> setvar o
```
which is a "bad state" because of the shift-reduce conflict. We want to
remove the reduce node, and we do so by backing up to see how we got here.
We obtained this state by `shift setvar` applied to
```
-> { <. o setvar , setvar >. | wff }
-> <. o class , class >.
all setvar -> o rules
all class -> o rules
```
and we want to repair this state so that we don't hit the train wreck one
step from here. So we delete the offending rule `-> o setvar` and add the
composition of `class -> setvar` with `class -> <. class , class >.` as a
new "composite rule" which looks like a production `class -> <. setvar ,
class >.`, so that the "before" state instead looks like
```
-> { <. o setvar , setvar >. | wff }
-> <. o class , class >.
-> <. o setvar , class >.
all setvar -> o rules
all class -> o rules except -> o setvar
```
and we `shift setvar` from here instead, getting to
```
-> { <. setvar o , setvar >. | wff }
-> <. setvar o , class >.
```
and we have safely cleared the conflict. (The modified "before" state is
not a real state, it is only used to calculate this new state. This state
is replacing the original shift-reduce bad state as the result of `shift
setvar` applied to
```
-> { <. o setvar , setvar >. | wff }
-> <. o class , class >.
all setvar -> o rules
all class -> o rules
```
.)
To finish the example off, let's make it to the end. The next step is
`shift ,` which takes us to
```
-> { <. setvar , o setvar >. | wff }
-> <. setvar , o class >.
all setvar -> o rules
all class -> o rules
```
(and `shift y` takes us to the simple `-> y o` state, so we plan to reduce
there and come back here with `shift setvar`), and `shift setvar` from here
takes us to
```
-> { <. setvar , setvar o >. | wff }
-> setvar o
```
which is another shift reduce conflict. Again, we analyze the conflict to
find out what to composite. We want to apply the `class -> setvar`
*production here, which was considered one step ago because closure over
`-> <. setvar , o class >.` required us to add the `-> o setvar` production
to the state. So we composite `class -> <. setvar , class >.` and `class ->
setvar` to get a new `class -> <. setvar , setvar >.` production, create a
temporary modified version of the previous state
```
-> { <. setvar , o setvar >. | wff }
-> <. setvar , o class >.
-> <. setvar , o setvar >.
all setvar -> o rules
all class -> o rules except -> o setvar
```
and use it to calculate the new result of `shift setvar`, which is
```
-> { <. setvar , setvar o >. | wff }
-> <. setvar , setvar o >.
```
and we have cleared another shift reduce conflict. There is still one more
to go, though, since the next step is `shift >.` which takes us to
```
-> { <. setvar , setvar >. o | wff }
-> <. setvar , setvar >. o
```
which is again a shift reduce conflict. Now we must backtrack a lot,
because we have to go back 5 steps (the length of the current reduce
candidate) to find out which production required us to put `-> <. setvar ,
setvar >.` into the mix. In fact, five steps ago this production was not
even `-> <. setvar , setvar >.` at all but rather `-> <. class , class >.`
but this makes no difference, we need two productions to do the
compositing. This is the first state I posted, which looks like
```
-> { o <. setvar , setvar >. | wff }
-> { o class }
all class -> o rules
```
where among the `class` rules is `-> o <. class , class >.`. The reason the
class rules are there is because of the `-> { o class }` rule, so we
composite `class -> { class }` with `class -> <. setvar , setvar >.` to
get the temporary state
```
-> { o <. setvar , setvar >. | wff }
-> { o class }
-> { o <. setvar , setvar >. }
all class -> o rules except -> o <. class , class >.
```
and now shift 5 steps forward along `<. setvar , setvar >.` to get the
repaired state
```
-> { <. setvar , setvar >. o | wff }
-> { <. setvar , setvar >. o }
```
Now we have finally cleared the last hurdle, as we can clearly now either
`shift |` or `shift }` depending on what we see next to parse both `{ <. x
, y >. | ph }` and `{ <. x , y >. }`. For the purpose of the example let's
say we shift `}` so we get to state
```
-> { <. setvar , setvar >. } o
```
and a reduce is unambiguous.
But what are we reducing anyway? I've been talking about compositing rules,
and what I haven't been showing here is that each production is associated
to a partial expression tree. You can imagine them as lambda expressions.
The original rules from the grammar will have a result like `\e1 e2. (cpr
e1 e2)` for the production `class -> <. class , class >.`, which is to say
that we take the two class expressions in the brackets and put them into
the two arguments of a `cpr` node. The arguments aren't always in parse
order, for example I think `wal` takes its arguments in the order `wal ph
x` (because the `$f` variable declaration of `vx` comes after `wph`), so
the production `wff -> A. setvar wff` has result `\e1 e2. (wal e2 e1)`.
Now compositing rules has the effect of a composition of these two
expressions. In the first part we composited `class -> <. class , class >.`
with `class -> setvar`, with associated results `\e1 e2. (cpr e1 e2)` and
`\e1. (cv e1)`, so we insert the `cv` expression in for `e1` of the `cpr`
expression to get a new result `\e1 e2. (cpr (cv e1) e2)` for the new
production `class -> <. setvar , class >.`. Similarly, the production
`class -> <. setvar , setvar >.` is formed by inserting `cv` in for `e2` in
this new production, resulting in `\e1 e2. (cpr (cv e1) (cv e2))`. And
finally, we composited this expression with the `class -> { class }`
production with result `\e1. (csn e1)`, and this composition yields, for
the composite rule `class -> { <. setvar , setvar >. }`, the result `\e1
e2. (csn (cpr (cv e1) (cv e2)))`. This is what we reduce with.
So for the example `{ <. x , y >. }`, we first reduce using `setvar -> x :=
vx`, then `setvar -> y := vy`, then `class -> { <. setvar , setvar >. } :=
\e1 e2. (csn (cpr (cv e1) (cv e2)))` to get the final parse tree `(csn (cpr
(cv vx) (cv vy)))`.
--
You received this message because you are subscribed to the Google Groups
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/metamath/CAFXXJSta23XJ1nhGbKe4eH1cXwFg0D2KT%3DTdhWr%2BhB32ZmAM6Q%40mail.gmail.com.