Re: [Metamath] Re: How should I approach comment parsing?

Mario Carneiro Wed, 18 May 2022 18:35:37 -0700

Not all comments are markup comments. It is up to you how to handle this:
you could parse all comments as markup and give up if it doesn't go well,
or only try to parse comments in particular positions as markup. The latter
is what mm-knife does: comments immediately before a labeled statement and
header comments are parsed as markup and others are ignored.


On Wed, May 18, 2022 at 9:31 PM 'Philip White' via Metamath <
[email protected]> wrote:

> Maybe I'm running into issues that were mentioned in the email thread
> you linked to(
> https://groups.google.com/g/metamath/c/kdbrroGzJnc/m/7PCMZGgSAQAJ ),
> but what's the right thing to do about this comment text that shows up
> in set.mm:
>
>   @( The cardinality of a set is a cardinal number. @)
>   cardel @p |- ( card ` A ) e. Card @=
>     ( ccrd cfv ccdn wcel wceq cardidm elcard mpbir ) ABCZDEJBCJFAGJHI @.
>     @( [16-Jan-2005] @) @( [23-Sep-2004] @)
>
> It's a theorem with backticks in it that got commented out, but the
> backticks weren't doubled. My parser thinks that this backtick should
> start math mode, and that seems right? I wouldn't mind sending a PR to
> fix the places that break this rule, if that is desirable.
>
> On Sat, 14 May 2022 00:55:40 -0400
> Mario Carneiro <[email protected]> wrote:
>
> > On Fri, May 13, 2022 at 8:15 PM 'Philip White' via Metamath <
> > [email protected]> wrote:
> >
> > > Thanks for the depth of the responses!
> > >
> > > A way that I'm used to writing parsers is to first write a lexer
> > > that operates on the part of the syntax that falls into the
> > > "regular" class of grammars. The main metamath syntax fits this
> > > neatly, including comments. However, once you try to understand the
> > > contents of those comments, the only part of the syntax that
> > > remains regular is single characters as tokens, I think. To
> > > illustrate this, note that "latexdef" is an ordinary symbol token
> > > in the main metamath syntax, but it is a keyword in $t comment
> > > syntax.
> >
> > You should parse the comment twice with two entirely separate parsers.
> > Using a single multi-mode parser is going to be very messy, especially
> > because one parser has absolute veto power over the other.
> >
> >
> > > I'm glad you wrote down the grammar for comments; I was looking for
> > > that at one point. The grammar does use characters as tokens, so it
> > > doesn't solve the issue (I know you weren't saying that it did).
> > >
> >
> > I'm not sure what you mean by "solve the issue". That's what the
> > grammar is; it plainly allows no spaces around at least semicolons so
> > the grammar has to account for that.
> >
> > My first pass at this has been to parse the main metamath syntax, and
> > > then run a totally separate parser on the string contents of each
> > > comment.
> > >
> >
> > Good. Do that. It will be good for your sanity.
> >
> > However, this won't hold up now that I'm starting to parse
> > > the markup of ordinary comments, since math mode requires switching
> > > back to the ordinary metamath tokens.
> > >
> >
> > Oh... the math markup syntax is a whole separate mess. The issue
> > there is that the parser is a bunch of layered regexes, so it has no
> > nice grammar at all, at least if you care about the edge cases. The
> > recent discussion
> > https://groups.google.com/g/metamath/c/kdbrroGzJnc/m/7PCMZGgSAQAJ
> > chronicles my attempt to write a fully accurate markup parser.
> >
> > Ignoring the edge cases, I think the most reasonable way to understand
> > markup parsing is just as having a string-quoting operator. The body
> > text is parsed on a character level, but `foo` starts a quoted math
> > section, and you use the metamath parser (split on space) to get the
> > tokens out (after replacing `` with `). I don't think using a CFG
> > grammar here would help.
> >
> >
> > > A solution that seems to me more elegant (though I haven't tried it
> > > yet), is to have a single parser, but multiple lexing functions that
> > > operate. The parser calls the appropriate lexer. Each lexer reads
> > > the regular part of its respective "sub-syntax", and the parser
> > > decides when to switch between sub-languages.
> > >
> >
> > The different parsers really have nothing in common. It's just
> > completely separate languages with one hiding out in the legal
> > comment syntax of the other.
> >
> > At several points in this process, I've found myself wishing that the
> > > minimalism of the main metamath syntax was preserved in the comment
> > > syntax. For example, the $t comment section could have used
> > > $latexdef and $htmldef as keywords, and instead of C-style
> > > comments, allow actual nested comments. The property of the syntax
> > > being a sequence of space-delimited symbols plus keywords that
> > > start with $ is refreshingly simple, and it would have been nice to
> > > work with it in comments. Anyway, I don't mean to criticize from an
> > > armchair the folks who implemented this system; it would be rather
> > > difficult to change at this point, I would think.
> > >
> >
> > I wasn't involved in the original decision making here but if I had to
> > guess, the reason is that we don't want the $ to confuse a metamath
> > parser. Keep in mind that for the vast majority of metamath parsers,
> > comments are just comments, no parsing is required. In particular
> > nested comments would impose a burden on simplistic verifiers, so
> > using a different style of comment seems like a good idea to me. (I
> > wouldn't mind spacing things out a bit more though, to keep the same
> > simple lexer.)
> >
> >
> > > Another question about comments: what is a good heuristic for
> > > deciding when a comment is attached to a statement? My first guess
> > > has been to attach the immediately preceding comment to each
> > > theorem or axiom, if there is one. This seems to yield good
> > > results, but I noticed that set.mm sometimes has comments on the
> > > same line, directly after the the statement (this is especially
> > > common for constant and variable declarations). I doubt that
> > > attaching an comments immediately after a statement would be good,
> > > unless I make the parser be whitespace-sensitive so that I can
> > > distinguish when it is on the same line as the declaration.
> > >
> >
> > The official definition is that a markup comment immediately precedes
> > a labeled statement ($f $a and $p, I don't think you can comment $e).
> > $c and $v statements don't have tooling-recognized comments, so you
> > should read those comments as analogous to single line non-doc
> > comments. If you would like to parse these comments, one reasonable
> > definition is that the comment should come immediately after the $c
> > comment without an intervening newline. (Again, "unless I make the
> > parser be whitespace-sensitive so that I can distinguish when it is
> > on the same line as the declaration." sounds like you are trying to
> > make one parser to rule them all and this will not end well. Markup
> > parsing is substantially more heuristic than everything else, likely
> > because it arose organically from unofficial formatting conventions.)
> >
> > Mario
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/metamath/20220518190056.20061d14%40ithilien
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJSsacv8-DxUDWPFYF9fhZmDmbhymWQt_MDFPdrgV8%2BXrmw%40mail.gmail.com.

Re: [Metamath] Re: How should I approach comment parsing?

Reply via email to