Re: [Metamath] Re: How should I approach comment parsing?

Mario Carneiro Fri, 13 May 2022 16:16:13 -0700

Parsing comments as a metamath verifier is not very hard, and does not
require any special tokenization. This is what Benoit is referring to.
However, if you want to parse $t or $j comments, you will need a
specialized parser for those, and it is true that the tokenization rules
are different, kind of C style. The most accurate $t/$j parser I am aware
of is
https://github.com/david-a-wheeler/metamath-knife/blob/360e7b1/src/parser.rs#L809-L933
, but I think you can do the same thing with a more traditional parser
generator style. The summary is: fancy comments are a subset of the regular
comment grammar. (Regular comments, parsed with the standard metamath lexer
(i.e. whitespace separated everything) have the form '$(' token* '$)' where
no token contains an embedded occurrence of '$)' or '$('.) The first token
in a fancy comment must be '$t' or '$j'. The remainder of the comment is
lexed as follows:


fancy-comment ::= '$(' ws ('$t' | '$j') (ws token)* ws '$)'
ws ::= (' ' | '\t' | \n' | '\r' | '/*' .* '*/')*
string ::= '[^']*' | "[^"]*"
keyword ::= [^;]+
element ::= string | keyword
token := element | ';'

In other words, inside fancy comments there are C style comments /* like
this */, counted as part of whitespace that can appear between anything,
and then what remains is separated into strings (which can be delimited by
' or " but have no escapes), "keywords" (any other unquoted thing) and
semicolon (which delimits commands). The output of the lexing phase is the
initial '$t' or '$j' and the sequence of tokens. It can be further
separated into commands using the grammar:

command ::= keyword (string | keyword)* ';'
fancy-comment ::= ('$t' | '$j') command*

At this point you can start interpreting the commands. Each command has an
initial keyword which specifies the grammar of the rest. The '$t' and '$j'
commands differ in that '$t' is a closed class - there is an explicit list
of enumerated valid commands - while '$j' is designed as extensible and
parsers should ignore any unrecognized commands. The '$t' commands and
their grammars are:

sum ::= string ('+' string)*
typesetting-command ::=
  ('htmldef' | 'althtmldef' | 'latexdef') string 'as' sum ';' |
  'htmlvarcolor' sum ';' |
  'htmltitle' sum ';' |
  'htmlhome' sum ';' |
  'exthtmltitle' sum ';' |
  'exthtmlhome' sum ';' |
  'exthtmllabel' sum ';' |
  'htmldir' sum ';' |
  'althtmldir' sum ';' |
  'htmlbibliography' sum ';' |
  'exthtmlbibliography' sum ';' |
  'htmlcss' sum ';' |
  'htmlfont' sum ';' |
  'htmlexturl' sum ';'

The list of '$j' commands is open ended as mentioned, but some commands you
will find in set.mm are:

extra-command ::=
  'syntax' string ('as' string)? ';' |
  'unambiguous' string ';' |
  'primitive' string* ';' |
  'justification' string 'for' string ';' |
  'equality' string 'from' string string string ';' |
  'definition' string 'for' string ';' |
  'congruence' string* ';' |
  'bound' string ';' |
  'natded_init' string string string ';' |
  ('natded_assume' | 'natded_weak' | 'natded_cut') string* ';' |
  ('natded_true' | 'natded_imp' | 'natded_and' | 'natded_or' |
'natded_not') string+ 'with' string* ';' |
  ('free_var' | 'free_var_in') string 'with' string* ';' |
  ('condequality' | 'notfree') string 'from' string ';' |
  'condcongruence' string* ';' |
  'notfree' string 'from' string ';' |
  'restatement' string 'of' string ';' |
  'garden_path' element+ '=>' element+ ';' |
  ...

I don't know any verifier or proof assistant that recognizes all of these;
they were added at different times for different tools and set.mm has
historically been pretty okay with adding various kinds of metadata to
support specialized analyses.



On Thu, May 12, 2022 at 7:21 PM Benoit <[email protected]> wrote:

> The EBNF in the Metamath book differs slightly from metamath.c in that "$(
> $( $)" returns an error in metamath.c but is accepted by the EBNF grammar
> and by mmverify.py. I opens an issue there (
> https://github.com/metamath/metamath-book/issues/233).
>
> I clarified a bit readc() in mmverify.py, and I wrote a similar function
> for my OCaml verifier, which can forbid the above example, like
> metamath.c.  I copy both below:
> -----------------------------------
>     def readc(self) -> StringOption:
>         """Read the next token once included files have been expanded and
>         comments have been skipped.
>         """
>         tok = self.readf()
>         while tok == '$(':
>             # Note that we use 'read' in this while-loop, and not 'readf',
>             # since inclusion statements within a comment are still
> comments
>             # so should be skipped.
>             # The following line is not necessary but makes things clearer;
>             # note the similarity with the first three lines of 'readf'.
>             tok = self.read()
>             while tok and tok != '$)':
>                 tok = self.read()
>             if not tok:
>                 raise MMError("Unclosed comment at end of file.")
>             assert(tok == '$)')
>             tok = self.read()
>         vprint(70, "Token once comments skipped:", tok)
>         return tok
> -----------------------------------
> (** Read a comment (and ignore its content, so does not need to be passed a
>     [parse_env] as argument).  Comments should not nest, so opening-comment
>     tokens are ignored. *)
> let read_comment ic =
>   Printf.printf "Reading a comment.\n";
>   let rec read_comment_content ic =
>     match read_token ic with
>     | Some CloseComm -> Printf.printf "End of comment\n"
>     | Some OpenComm ->
>         Printf.printf "Comments should not nest.  Ignoring.\n";
>         read_comment_content ic
>     | Some _ -> read_comment_content ic
>     | None -> Printf.printf "File ended while reading comment.
> Ignoring.\n"
>   in
>   read_comment_content ic
> -----------------------------------
>
> Benoît
>
>
> On Friday, May 13, 2022 at 12:39:25 AM UTC+2 Philip White wrote:
>
>> I'm working on implementing comment parsing in my verifier. It doesn't
>> seem to lend itself well to elegant implementations, however, since the
>> token structure of comments seems to be different than that of
>> non-comment text. In addition, the token rules of $t comments and
>> ordinary comments seems to be different (for example, normal comments
>> don't have the C-style comments, but $t comments do).
>>
>> I can certainly figure this out by handling all the special cases and
>> bolting together a few different parsers, but I'm wondering if anyone
>> has found an nice technique for this.
>>
>> - Philip
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com
> <https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJSuVuvgKqeyRmv3%3DaqT0FScpobvrjCyvU0H4HcEjfWNoQg%40mail.gmail.com.

Re: [Metamath] Re: How should I approach comment parsing?

Reply via email to