Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Cris Perdue Fri, 17 Feb 2023 12:52:39 -0800

This discussion is bringing up some interesting topics, including other
efforts to apply ML to theorem proving. All very interesting to see.


About serving JSON with Javascript rendering vs. serving straight HTML --

First I want to say that Mario's comments are persuasive as well as
interesting from my point of view.

Next, this discussion seems to be rapidly morphing into "JSON versus HTML"
rather than "JSON versus plain text", especially given that others are
already generating plain text from set.mm for machine learning purposes.
Yes?

With that in mind --

Is there much reason *not* to create JSON *in addition to HTML*? If disk
space is a concern, consider that it can be a win-win to store
precompressed pages (i.e. with gzip). You see, the lion's share of requests
are for compressed data anyway, and if it is precompressed on the server,
there is actually less processing to be done when serving the page that
way. This article shows how it can be accomplished with an nginx server
like ours in a handful of lines of shell scripting and server
configuration:
https://blog.llandsmeer.com/tech/2019/08/29/precompression.html.

And for what it's worth, about users turning off JavaScript --

It is a tiny percentage of users who turn off JavaScript. Numbers around 1%
have been thrown around, but for more careful analysis see for example this
blog posting from uk.gov --
https://gds.blog.gov.uk/2013/10/21/how-many-people-are-missing-out-on-javascript-enhancement/.
They found that out of around 500,000 visits to the gov.uk home page, about
1,000 loaded a "noscript" image, indicating that JavaScript was turned off
or not supported. Around 4,000 visits loaded *neither* that one nor one
served when JavaScript was turned on. The posting and comments discuss
various reasons why this might have occurred, including a suggestion that
numerous visitors "have disabled image-loading in their mobile browsers",
which would also be an interesting scenario for the metamath site if true.

Besides, for those who care enough to visit a site like us.metamath.org,
why not just turn JavaScript back on, anyway?

This next may not wind up being a big concern, but about search engines
indexing sites in which a lot of the HTML is generated by JavaScript -- I
know that for some years Google Search has been pre-rendering pages during
crawl, apparently not to the pixel level, but to the document structure
(DOM tree) level, which would be the important one for concerns about use
of JSON. And Google apparently is not alone in this. Here is a fairly
detailed article on the subject, somewhat focused on Google -
https://www.seoquake.com/blog/render-pages/.

Finally, I find it hard to overstate the potential that can be unlocked by
providing suitable (JSON-format) data describing proofs to Web browsers.
Browsers these days are really user interface engines, not just HTML
displayers.

-Cris

On Fri, Feb 17, 2023 at 9:49 AM Mario Carneiro <[email protected]> wrote:

> I do see big downsides for the use cases listed here though:
>> 1. For training a GPT, I fear that the system would focus on "it's a JSON
>> file!"
>> and any tool-specific training would be swamped by thinking it's the same
>> as other JSON files.
>> I think a GPT system that trains on the "Internet as a whole" is more
>> likely to
>> generate some useful information if we minimize distractors.
>> In short, make it REALLY OBVIOUS that this text is regular in some way,
>> yet
>> clearly different from anything else, and expose structure where we
>> reasonably can.
>
>
> I think this is looking at it the wrong way. Stuff trained on the entire
> internet is going to have a clear upper bound on the precision of the
> answers it can give. I'm not really interested in supplying a poor quality
> data source just for the purpose of a speculative application like this.
> For projects that are actually interested in doing ML specifically on
> metamath, it is very normal to have some data wrangling by the researchers
> beforehand to prepare things for the model. This kind of glue code is
> usually some ad-hoc python script so it helps if we are providing a data
> source that is easily manipulable. (I'm working on a paper with Josef Urban
> regarding ATP for metamath and it benefits significantly from
> metamath-knife as a way of getting information from mm files, but it's not
> always possible to use mm-knife and if we are going for something more
> easily digestible from many languages without a dedicated library JSON is a
> strong choice.)
>
> 2. For rendering on the client side, there are big advantages to showing
>> static data
>> as straight HTML. For one, people who disable JavaScript can still see it
>> fine.
>> Sure, sometimes it's useful to run arbitrary code sent by strangers, but
>> there are good
>> reasons to minimize that as a requirement :-). It also means that search
>> engines see
>> "what we actually have".
>
>
> While I have some sympathy for this argument, I also think it is holding
> us back a lot. Catering to the least common denominator of no-JS means that
> our web pages are stuck permanently in web 1.0, with no search
> functionality, intra-page links, subproof expansion, mathml rendering,
> go-to-definition for symbols, or a laundry list of other features.
> Furthermore, doing the rendering client-side means that the cost of
> transmission is lower which means shorter load times and no need for a
> two-hour processing phase. (Even without making the web pages dynamic I've
> been considering generating the web pages on-demand and using nginx only as
> a cache. I already have a prototype that can serve HTML directly from the
> .mm file, no preprocessing necessary, and that means we don't have to worry
> about slow updates and stuff going down due to issues that occurred during
> the massive install script.)
>
>
> On Fri, Feb 17, 2023 at 4:27 AM Auguste Poiroux <[email protected]>
> wrote:
>
>> Hi everyone!
>>
>> I am very interested in this discussion topic too. There is an
>> organization that is currently working on something highly related to this
>> topic and is starting to release open-source datasets/models on
>> HuggingFace: https://huggingface.co/hoskinson-center. They provide a
>> math dataset and a first model of 6.7B parameters trained on it. In their
>> math dataset, they put several proof databases from different formal
>> systems, including Metamath. They used this code: 
>> https://github.com/zhangir-azerbayev/mm-extract
>> to extract a "human readable" form of the proofs. This seems to be close
>> to the format you propose David.
>> I think the model they provide can be used to make proof suggestions. I
>> don't know what this is worth though, I just stumbled across it. It may not
>> be the best, but I think it can be a really good start.
>> The work looks very recent, maybe they will publish some evaluation
>> results of their model soon.
>>
>> Auguste
>>
>> Le jeudi 16 février 2023 à 16:05:01 UTC+1, David A. Wheeler a écrit :
>>
>>>
>>> > On Feb 14, 2023, at 1:46 PM, Jon P <[email protected]> wrote:
>>> >
>>> > I think that's a great idea David! Seems like very low downside and it
>>> would be interesting to see if language models could train on the datasets.
>>> >
>>> > There's already an option to Customise GPT3 with additional datasets
>>> so once this is ready it could be an interesting experiment to try.
>>> >
>>> > https://openai.com/blog/customized-gpt-3/
>>>
>>> Thanks so much for that link. So I looked into creating a
>>> specially-trained model
>>> to create metamath proofs. that page points to this pricing model for
>>> GPT-3:
>>> https://openai.com/api/pricing/#faq-token
>>>
>>> I did a quick estimate of training costs for the top GPT-3 model on
>>> all of set.mm. It comes out to about $2600.
>>> That's less than I expected, and that's easily affordable in a grant
>>> (you could train multiple times),
>>> but it's not an amount I have pocket change for :-(.
>>> Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15
>>> cents).
>>> See below for how I created the quick estimate.
>>> We could do a subset of set.mm (e.g., skip mathboxes, etc.), though
>>> fewer training examples
>>> would typically produce less-good results.
>>>
>>> I can't possibly afford that myself. That said, there are alternatives.
>>> There are many other
>>> Generative Pre-trained Transformer (GPT) models that are less capable
>>> but
>>> also less expensive. My current personal computer is useless
>>> for ML training, but Kaggle.com provides a nice Jupyter notebook
>>> interface for AI/ML
>>> work that's free to use. Here are some pages that discuss using GPT-2
>>> and Kaggle:
>>>
>>> https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners
>>>
>>> https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook
>>> https://www.kaggle.com/general/116387
>>>
>>> I had also been looking at adding definition checks to metamath-knife,
>>> and maybe trying to
>>> help this new JavaScript metamath proof assistant. Sigh, not enough time
>>> in the day :-).
>>>
>>> --- David A. Wheeler
>>>
>>>
>>>
>>> === Cost estimate ===
>>>
>>> Here's. cost estimate using https://openai.com/api/pricing/
>>>
>>> OpenAI's most powerful GPT-3 model is Davinci. Their charges are:
>>> * Fine-tuned training: $0.0300 / 1K tokens,
>>> * Usage: $0.1200 / 1K tokens.
>>>
>>> I did a quick estimate of tokens in the entirety of set.mm using:
>>> metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show
>>> statement *' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------'
>>> | wc
>>> That yielded 64218080 words. A token is about 4 characters or 0.75
>>> words, so for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens.
>>> Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569.
>>>
>>> Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost =
>>> (1256/1000)*$0.12 = $0.15
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Metamath" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com
>> <https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/metamath/CAFXXJSuRZBQ0E%2BLgGmMEgMUhsw5wFkLsL7%3DGe62w0gfxUBKPkw%40mail.gmail.com
> <https://groups.google.com/d/msgid/metamath/CAFXXJSuRZBQ0E%2BLgGmMEgMUhsw5wFkLsL7%3DGe62w0gfxUBKPkw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAOoe%3DW%2BkA2Cse3rcvRrq30aFzwBGyx%3DK0kQdD2pMTbJG_OkCNw%40mail.gmail.com.

Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to