Re: [Metamath] Re: How should I approach comment parsing?

Mario Carneiro Fri, 13 May 2022 21:55:55 -0700

On Fri, May 13, 2022 at 8:15 PM 'Philip White' via Metamath <
[email protected]> wrote:


> Thanks for the depth of the responses!
>
> A way that I'm used to writing parsers is to first write a lexer that
> operates on the part of the syntax that falls into the "regular" class
> of grammars. The main metamath syntax fits this neatly, including
> comments. However, once you try to understand the contents of those
> comments, the only part of the syntax that remains regular is single
> characters as tokens, I think. To illustrate this, note that "latexdef"
> is an ordinary symbol token in the main metamath syntax, but it is a
> keyword in $t comment syntax.
>

You should parse the comment twice with two entirely separate parsers.
Using a single multi-mode parser is going to be very messy, especially
because one parser has absolute veto power over the other.


> I'm glad you wrote down the grammar for comments; I was looking for
> that at one point. The grammar does use characters as tokens, so it
> doesn't solve the issue (I know you weren't saying that it did).
>

I'm not sure what you mean by "solve the issue". That's what the grammar
is; it plainly allows no spaces around at least semicolons so the grammar
has to account for that.

My first pass at this has been to parse the main metamath syntax, and
> then run a totally separate parser on the string contents of each
> comment.
>

Good. Do that. It will be good for your sanity.

However, this won't hold up now that I'm starting to parse
> the markup of ordinary comments, since math mode requires switching
> back to the ordinary metamath tokens.
>

Oh... the math markup syntax is a whole separate mess. The issue there is
that the parser is a bunch of layered regexes, so it has no nice grammar at
all, at least if you care about the edge cases. The recent discussion
https://groups.google.com/g/metamath/c/kdbrroGzJnc/m/7PCMZGgSAQAJ
chronicles my attempt to write a fully accurate markup parser.

Ignoring the edge cases, I think the most reasonable way to understand
markup parsing is just as having a string-quoting operator. The body text
is parsed on a character level, but `foo` starts a quoted math section, and
you use the metamath parser (split on space) to get the tokens out (after
replacing `` with `). I don't think using a CFG grammar here would help.


> A solution that seems to me more elegant (though I haven't tried it
> yet), is to have a single parser, but multiple lexing functions that
> operate. The parser calls the appropriate lexer. Each lexer reads the
> regular part of its respective "sub-syntax", and the parser decides
> when to switch between sub-languages.
>

The different parsers really have nothing in common. It's just completely
separate languages with one hiding out in the legal comment syntax of the
other.

At several points in this process, I've found myself wishing that the
> minimalism of the main metamath syntax was preserved in the comment
> syntax. For example, the $t comment section could have used $latexdef
> and $htmldef as keywords, and instead of C-style comments, allow actual
> nested comments. The property of the syntax being a sequence of
> space-delimited symbols plus keywords that start with $ is refreshingly
> simple, and it would have been nice to work with it in comments.
> Anyway, I don't mean to criticize from an armchair the folks who
> implemented this system; it would be rather difficult to change at this
> point, I would think.
>

I wasn't involved in the original decision making here but if I had to
guess, the reason is that we don't want the $ to confuse a metamath parser.
Keep in mind that for the vast majority of metamath parsers, comments are
just comments, no parsing is required. In particular nested comments would
impose a burden on simplistic verifiers, so using a different style of
comment seems like a good idea to me. (I wouldn't mind spacing things out a
bit more though, to keep the same simple lexer.)


> Another question about comments: what is a good heuristic for deciding
> when a comment is attached to a statement? My first guess has been to
> attach the immediately preceding comment to each theorem or axiom, if
> there is one. This seems to yield good results, but I noticed that
> set.mm sometimes has comments on the same line, directly after the
> the statement (this is especially common for constant and variable
> declarations). I doubt that attaching an comments immediately after a
> statement would be good, unless I make the parser be
> whitespace-sensitive so that I can distinguish when it is on the same
> line as the declaration.
>

The official definition is that a markup comment immediately precedes a
labeled statement ($f $a and $p, I don't think you can comment $e). $c and
$v statements don't have tooling-recognized comments, so you should read
those comments as analogous to single line non-doc comments. If you would
like to parse these comments, one reasonable definition is that the comment
should come immediately after the $c comment without an intervening
newline. (Again, "unless I make the parser be whitespace-sensitive so that
I can distinguish when it is on the same line as the declaration." sounds
like you are trying to make one parser to rule them all and this will not
end well. Markup parsing is substantially more heuristic than everything
else, likely because it arose organically from unofficial formatting
conventions.)

Mario

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJSs_YS3hsES3x52o_Wy-kV4MqQaRJXWBi%2Bc_3_Nedaz1%2Bw%40mail.gmail.com.

Re: [Metamath] Re: How should I approach comment parsing?

Reply via email to