Re: [Metamath] Re: How should I approach comment parsing?

'Philip White' via Metamath Fri, 13 May 2022 17:15:51 -0700

Thanks for the depth of the responses!

A way that I'm used to writing parsers is to first write a lexer that
operates on the part of the syntax that falls into the "regular" class
of grammars. The main metamath syntax fits this neatly, including
comments. However, once you try to understand the contents of those
comments, the only part of the syntax that remains regular is single
characters as tokens, I think. To illustrate this, note that "latexdef"
is an ordinary symbol token in the main metamath syntax, but it is a
keyword in $t comment syntax.


I'm glad you wrote down the grammar for comments; I was looking for
that at one point. The grammar does use characters as tokens, so it
doesn't solve the issue (I know you weren't saying that it did). I was
able to figure out the $t definition types through reading the book and
trial and error, but I haven't even tried to think about $j comments
yet (not sure if I will).

My first pass at this has been to parse the main metamath syntax, and
then run a totally separate parser on the string contents of each
comment. However, this won't hold up now that I'm starting to parse
the markup of ordinary comments, since math mode requires switching
back to the ordinary metamath tokens.

A solution that seems to me more elegant (though I haven't tried it
yet), is to have a single parser, but multiple lexing functions that
operate. The parser calls the appropriate lexer. Each lexer reads the
regular part of its respective "sub-syntax", and the parser decides
when to switch between sub-languages.

At several points in this process, I've found myself wishing that the
minimalism of the main metamath syntax was preserved in the comment
syntax. For example, the $t comment section could have used $latexdef
and $htmldef as keywords, and instead of C-style comments, allow actual
nested comments. The property of the syntax being a sequence of
space-delimited symbols plus keywords that start with $ is refreshingly
simple, and it would have been nice to work with it in comments.
Anyway, I don't mean to criticize from an armchair the folks who
implemented this system; it would be rather difficult to change at this
point, I would think.

Another question about comments: what is a good heuristic for deciding
when a comment is attached to a statement? My first guess has been to
attach the immediately preceding comment to each theorem or axiom, if
there is one. This seems to yield good results, but I noticed that
set.mm sometimes has comments on the same line, directly after the
the statement (this is especially common for constant and variable
declarations). I doubt that attaching an comments immediately after a
statement would be good, unless I make the parser be
whitespace-sensitive so that I can distinguish when it is on the same
line as the declaration.

- Philip

On Thu, 12 May 2022 20:16:01 -0400
Mario Carneiro <[email protected]> wrote:

> Parsing comments as a metamath verifier is not very hard, and does not
> require any special tokenization. This is what Benoit is referring to.
> However, if you want to parse $t or $j comments, you will need a
> specialized parser for those, and it is true that the tokenization
> rules are different, kind of C style. The most accurate $t/$j parser
> I am aware of is
> https://github.com/david-a-wheeler/metamath-knife/blob/360e7b1/src/parser.rs#L809-L933
> , but I think you can do the same thing with a more traditional parser
> generator style. The summary is: fancy comments are a subset of the
> regular comment grammar. (Regular comments, parsed with the standard
> metamath lexer (i.e. whitespace separated everything) have the form
> '$(' token* '$)' where no token contains an embedded occurrence of
> '$)' or '$('.) The first token in a fancy comment must be '$t' or
> '$j'. The remainder of the comment is lexed as follows:
> 
> fancy-comment ::= '$(' ws ('$t' | '$j') (ws token)* ws '$)'
> ws ::= (' ' | '\t' | \n' | '\r' | '/*' .* '*/')*
> string ::= '[^']*' | "[^"]*"
> keyword ::= [^;]+
> element ::= string | keyword
> token := element | ';'
> 
> In other words, inside fancy comments there are C style comments /*
> like this */, counted as part of whitespace that can appear between
> anything, and then what remains is separated into strings (which can
> be delimited by ' or " but have no escapes), "keywords" (any other
> unquoted thing) and semicolon (which delimits commands). The output
> of the lexing phase is the initial '$t' or '$j' and the sequence of
> tokens. It can be further separated into commands using the grammar:
> 
> command ::= keyword (string | keyword)* ';'
> fancy-comment ::= ('$t' | '$j') command*
> 
> At this point you can start interpreting the commands. Each command
> has an initial keyword which specifies the grammar of the rest. The
> '$t' and '$j' commands differ in that '$t' is a closed class - there
> is an explicit list of enumerated valid commands - while '$j' is
> designed as extensible and parsers should ignore any unrecognized
> commands. The '$t' commands and their grammars are:
> 
> sum ::= string ('+' string)*
> typesetting-command ::=
>   ('htmldef' | 'althtmldef' | 'latexdef') string 'as' sum ';' |
>   'htmlvarcolor' sum ';' |
>   'htmltitle' sum ';' |
>   'htmlhome' sum ';' |
>   'exthtmltitle' sum ';' |
>   'exthtmlhome' sum ';' |
>   'exthtmllabel' sum ';' |
>   'htmldir' sum ';' |
>   'althtmldir' sum ';' |
>   'htmlbibliography' sum ';' |
>   'exthtmlbibliography' sum ';' |
>   'htmlcss' sum ';' |
>   'htmlfont' sum ';' |
>   'htmlexturl' sum ';'
> 
> The list of '$j' commands is open ended as mentioned, but some
> commands you will find in set.mm are:
> 
> extra-command ::=
>   'syntax' string ('as' string)? ';' |
>   'unambiguous' string ';' |
>   'primitive' string* ';' |
>   'justification' string 'for' string ';' |
>   'equality' string 'from' string string string ';' |
>   'definition' string 'for' string ';' |
>   'congruence' string* ';' |
>   'bound' string ';' |
>   'natded_init' string string string ';' |
>   ('natded_assume' | 'natded_weak' | 'natded_cut') string* ';' |
>   ('natded_true' | 'natded_imp' | 'natded_and' | 'natded_or' |
> 'natded_not') string+ 'with' string* ';' |
>   ('free_var' | 'free_var_in') string 'with' string* ';' |
>   ('condequality' | 'notfree') string 'from' string ';' |
>   'condcongruence' string* ';' |
>   'notfree' string 'from' string ';' |
>   'restatement' string 'of' string ';' |
>   'garden_path' element+ '=>' element+ ';' |
>   ...
> 
> I don't know any verifier or proof assistant that recognizes all of
> these; they were added at different times for different tools and
> set.mm has historically been pretty okay with adding various kinds of
> metadata to support specialized analyses.
> 
> 
> 
> On Thu, May 12, 2022 at 7:21 PM Benoit <[email protected]> wrote:
> 
> > The EBNF in the Metamath book differs slightly from metamath.c in
> > that "$( $( $)" returns an error in metamath.c but is accepted by
> > the EBNF grammar and by mmverify.py. I opens an issue there (
> > https://github.com/metamath/metamath-book/issues/233).
> >
> > I clarified a bit readc() in mmverify.py, and I wrote a similar
> > function for my OCaml verifier, which can forbid the above example,
> > like metamath.c.  I copy both below:
> > -----------------------------------
> >     def readc(self) -> StringOption:
> >         """Read the next token once included files have been
> > expanded and comments have been skipped.
> >         """
> >         tok = self.readf()
> >         while tok == '$(':
> >             # Note that we use 'read' in this while-loop, and not
> > 'readf', # since inclusion statements within a comment are still
> > comments
> >             # so should be skipped.
> >             # The following line is not necessary but makes things
> > clearer; # note the similarity with the first three lines of
> > 'readf'. tok = self.read()
> >             while tok and tok != '$)':
> >                 tok = self.read()
> >             if not tok:
> >                 raise MMError("Unclosed comment at end of file.")
> >             assert(tok == '$)')
> >             tok = self.read()
> >         vprint(70, "Token once comments skipped:", tok)
> >         return tok
> > -----------------------------------
> > (** Read a comment (and ignore its content, so does not need to be
> > passed a [parse_env] as argument).  Comments should not nest, so
> > opening-comment tokens are ignored. *)
> > let read_comment ic =
> >   Printf.printf "Reading a comment.\n";
> >   let rec read_comment_content ic =
> >     match read_token ic with
> >     | Some CloseComm -> Printf.printf "End of comment\n"
> >     | Some OpenComm ->
> >         Printf.printf "Comments should not nest.  Ignoring.\n";
> >         read_comment_content ic
> >     | Some _ -> read_comment_content ic
> >     | None -> Printf.printf "File ended while reading comment.
> > Ignoring.\n"
> >   in
> >   read_comment_content ic
> > -----------------------------------
> >
> > Benoît
> >
> >
> > On Friday, May 13, 2022 at 12:39:25 AM UTC+2 Philip White wrote:
> >  
> >> I'm working on implementing comment parsing in my verifier. It
> >> doesn't seem to lend itself well to elegant implementations,
> >> however, since the token structure of comments seems to be
> >> different than that of non-comment text. In addition, the token
> >> rules of $t comments and ordinary comments seems to be different
> >> (for example, normal comments don't have the C-style comments, but
> >> $t comments do).
> >>
> >> I can certainly figure this out by handling all the special cases
> >> and bolting together a few different parsers, but I'm wondering if
> >> anyone has found an nice technique for this.
> >>
> >> - Philip
> >>  
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Metamath" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to [email protected].
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com
> > <https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >  
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/20220513202529.3477c005%40ithilien.

Re: [Metamath] Re: How should I approach comment parsing?

Reply via email to