Re: [Metamath] checkmm ported to TypeScript

Antony Bartlett Mon, 29 Aug 2022 06:37:26 -0700

On Mon, May 16, 2022 at 7:41 PM Mario Carneiro <[email protected]> wrote:

> The pop/reverse does seem unnecessary. I would just keep track of the
> position in the array, possibly using a little structure containing the
> array and the current position with some functions to get the current token
> and move to the next. In fact, you can use this interface to parse the next
> token on demand, so that you don't need to precalculate the array of tokens
> (which involves millions of tiny allocations); you just hold on to the
> input string and pull off tokens as you need.
>

On 17/05/2022 16:16, Antony Bartlett wrote:
>>
>> It's verifying set.mm for me in about eight seconds on the MacBook Pro
>> I'm using.  [...]
>>
>> Mario's suggestion could potentially shave off maybe a couple of seconds
>> from that, but I won't know until I try.
>>
>>
It's probably about time I reported back about this.  The fact that
JavaScript is asynchronous is usually one of its strengths.  You load or
download your .mm file, or a chunk of your file.  Use the 'await' or 'then'
keyword, and other parts of your program continue to execute until the file
is ready.  In an .mm file you're typically never more than three tokens
away from the possible end of a file inclusion and needing to load (or
download) an additional file.  That means to avoid having to pre-load all
the .mm data, the gettoken function needs to be asynchronous so we are able
to get a new file should we need one.  There is a slight overhead on making
a function asynchronous which I knew about but have never in my career as a
programmer had a problem with before.  It's a problem when you have
literally millions of calls, as is the case with gettoken because there are
millions (about 30 million was it?) of tokens in set.mm.  The checkmm
verifier slowed down threefold when I tried this - my eight second
verification of set.mm became 24 seconds.

There's a second reason for pre-loading all the .mm data up front, which is
there's no maximum token length in the .mm spec, so the only way of knowing
you have enough data in your buffer to load the next token is to have all
of the data.

I'm aware of one other approach to avoid having to pre-load all the .mm
data, which is to run the parser speculatively, throw an exception if you
run out of data, and where you catch the exception find a point to recover
from, wait for a little more data to be loaded, and try again.  That
requires a very well thought out parser that protects the user of the API
from this unpredictability.  What I have at the moment is a checkmm
verifier that I've deliberately exposed the internals of to make it fully
hackable.  This is good at the moment because a unifier and a html
generator for example both place new and different demands upon a .mm
parser which I have some vague ideas about but don't fully understand yet.
Maybe one day I'll be ready to write a .mm parser that is all things to all
people, very fast, and nicely hidden behind a good api, or maybe checkmm
will always be good enough for my purposes.  Certainly for the moment it's
exactly what I need.

By the way, I stand behind the decision to bundle parser and
verifier together, because I feel a verifier should be small and self
contained, while at the same time there's no reason not to expose the
parser as an API for other programs to use in a context where checkmm is
not being used as a verifier.

When I tried to avoid having a million-token array by lexing the data
twice, the verifier took eight seconds to run.  I couldn't tell if it was
faster or slower than using an array.  When I tried dropping support for
file inclusions so that I only had to lex once, I managed to shave a third
of a second off my eight second running time.

This demonstrates what I never doubted, which is that Mario is right about
a huge token array being an avoidable overhead in a parser/verifier, but
it's not a huge consideration - JavaScript arrays actually seem to be very
fast and efficient at handling a very large number of little strings.
Therefore I'm sticking with the token array for now.

    Best regards,

        Antony

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAJ48g%2BBrs7W0p%2B1_p_Gu9LDcs_Pri-ojk%2B0NrqOFYa6zcDgpQQ%40mail.gmail.com.

Re: [Metamath] checkmm ported to TypeScript

Reply via email to