On Sat, Dec 3, 2022 at 2:10 PM sgoto <[email protected]> wrote: > Oh, it would be really neat to make nearley able to parse set.mm! I had > to disable mine, because it ran out of memory and it was also taking too > long >
I've had a go at this myself. I'm going to report back now because my Christmas leave has just ended so it's going to be much slower going from here. It is distressingly easy to blow nearley's memory usage up - I think it's if you give it something with an infinite number of possible interpretations. So it's not immediately obvious how to tell the difference between a bug in my parser spec, and literally running out of memory. That can happen too, I've got a good parse of demo0.mm, but my json parse tree is one hundred times bigger than the original file! Multiply the size of set.mm by a hundred and that's four gig, and by default nodejs is limited to a little over one gig I think. Yes, that's probably why we're both running out of memory even after debugging our parser spec to the point where it can parse demo0.mm. There's just too much in each individual moo token, and rather a lot of arrays inside arrays too actually. The solution, which I haven't started to implement, is to massage the parse tree as it's being generated. This will end up as a more pleasing tree anyway. Nearley itself does this as it parses our parser spec https://github.com/kach/nearley/blob/master/lib/nearley-language-bootstrapped.ne#L101 That's the JavaScript function you see on the right hand side, after each bit of EBNF-like code on the left hand side. So that's my plan. I've still got a lot of work to do on my .mm parser however, but you can see it here https://github.com/Antony74/mmparsers The project says 'parsers' (plural) because I also intend to add a parser for .mmp files. The format of a .mm file is formally specified by some EBNF code given at the back of the Metamath book. I'm trying to produce a semi-canonical parser by cleaving as close to this spec as I can. (by the way, I think there's a bracket missing from the EBNF: https://github.com/metamath/metamath-book/pull/241) I say semi-canonical because I don't want to skip the comments and the whitespace, there's useful information in the comments, and I've got a prettier plugin for formatting .mm files, so I'm interested in whitespace too. For some operations performed on a .mm file it's likely to be desirable to preserve whitespace. Finally, for performance reasons it's impractical to use a parser without a lexer. This deviates considerably from the appearance of the EBNF. I also notice that the .mm lexemes COMPRESSED_PROOF_BLOCK, LABEL, and MATH_SYMBOL, are not lexically distinct. That means there's some overlap, you can't tell which of the three you are looking at just from the text, though it's easy to tell from the context of where it appears in the file. This is a solvable problem, but I haven't got it right yet - my parser's not as strict as it should be in terms of what characters it will accept for these lexemes. However, the railroad diagram that nearley generates is very nice, and probably a lot more pleasant to read than the original EBNF. https://antony74.github.io/mmparsers/src/mmParser/mmParser.html Best regards, Antony -- You received this message because you are subscribed to the Google Groups "Metamath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/CAJ48g%2BCGiuQ2F%3D5HjN5HXTHXD76AiEhdfuE8ysh6OT1bcmOY6g%40mail.gmail.com.
