On Sat, Dec 3, 2022 at 2:10 PM sgoto <[email protected]> wrote:

> Oh, it would be really neat to make nearley able to parse set.mm! I had
> to disable mine, because it ran out of memory and it was also taking too
> long
>

 I've had a go at this myself.  I'm going to report back now because my
Christmas leave has just ended so it's going to be much slower going from
here.  It is distressingly easy to blow nearley's memory usage up - I think
it's if you give it something with an infinite number of possible
interpretations.  So it's not immediately obvious how to tell the
difference between a bug in my parser spec, and literally running out of
memory.  That can happen too, I've got a good parse of demo0.mm, but my
json parse tree is one hundred times bigger than the original file!
Multiply the size of set.mm by a hundred and that's four gig, and by
default nodejs is limited to a little over one gig I think.  Yes, that's
probably why we're both running out of memory even after debugging our
parser spec to the point where it can parse demo0.mm.  There's just too
much in each individual moo token, and rather a lot of arrays inside arrays
too actually.

The solution, which I haven't started to implement, is to massage the parse
tree as it's being generated.  This will end up as a more pleasing tree
anyway.  Nearley itself does this as it parses our parser spec

https://github.com/kach/nearley/blob/master/lib/nearley-language-bootstrapped.ne#L101

That's the JavaScript function you see on the right hand side, after each
bit of EBNF-like code on the left hand side.

So that's my plan.  I've still got a lot of work to do on my .mm parser
however, but you can see it here

https://github.com/Antony74/mmparsers

The project says 'parsers' (plural) because I also intend to add a parser
for .mmp files.

The format of a .mm file is formally specified by some EBNF code given at
the back of the Metamath book.  I'm trying to produce a semi-canonical
parser by cleaving as close to this spec as I can.

(by the way, I think there's a bracket missing from the EBNF:
https://github.com/metamath/metamath-book/pull/241)

I say semi-canonical because I don't want to skip the comments and the
whitespace, there's useful information in the comments, and I've got a
prettier plugin for formatting .mm files, so I'm interested in whitespace
too.  For some operations performed on a .mm file it's likely to be
desirable to preserve whitespace.  Finally, for performance reasons it's
impractical to use a parser without a lexer.  This deviates considerably
from the appearance of the EBNF.

I also notice that the .mm lexemes COMPRESSED_PROOF_BLOCK, LABEL, and
MATH_SYMBOL, are not lexically distinct.  That means there's some overlap,
you can't tell which of the three you are looking at just from the text,
though it's easy to tell from the context of where it appears in the file.
This is a solvable problem, but I haven't got it right yet - my parser's
not as strict as it should be in terms of what characters it will accept
for these lexemes.

However, the railroad diagram that nearley generates is very nice, and
probably a lot more pleasant to read than the original EBNF.

https://antony74.github.io/mmparsers/src/mmParser/mmParser.html

    Best regards,

        Antony

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAJ48g%2BCGiuQ2F%3D5HjN5HXTHXD76AiEhdfuE8ysh6OT1bcmOY6g%40mail.gmail.com.

Reply via email to