Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

David A. Wheeler Sat, 18 Feb 2023 14:16:55 -0800

> On Feb 18, 2023, at 4:08 PM, Mario Carneiro <[email protected]> wrote:
> In case it wasn't clear (and it probably wasn't), in a future world in which 
> the default mechanism is some super fancy client side renderer it is still 
> possible to offer a fallback "static site" of HTML pages which is 
> nevertheless not generated in advance and is being served dynamically 
> server-side. This has a relatively small generation cost even if we don't 
> stick a CDN in front of it, but if we do then I think that's a viable 
> approach to having all the HTML pages with basically zero preprocessing, just 
> a smarter server. I would really love it if the only thing the site build 
> needs to do is download and load the latest set.mm .

I understand, that's just a typical dynamically-generated data file and/or HTML 
with possibly front-end JavaScript. Bog standard stuff. Of course, there are 
pluses & minuses with that approach. In the 2010s almost everyone switched to 
that model, today there's a big press back to statically-regenerated sites 
where practical (with Jekyll, etc.) because attackers routinely pop systems 
that do dynamic response.

There are big concerns if you hook up a program written in C to input sent by 
an attacker. Rigorous input filtering would definitely help, but using 
something like metamath-knife as a basis instead of metamath-exe might be 
better. So doing this would involve a nontrivial amount of software, *along* 
with analysis to ensure that it was unlikely to lead to a vulnerability. 
Statically generating files is a much simpler path from where we are. That's 
not to say we must go that way, but it'd take more effort to do full dynamic 
generation *with* confidence in its security.


> I don't think this is a very competitive option though, since you would have 
> to load all of set.mm.

It's more competitive than you might think. Igor's prover loads the whole 
database in a few seconds. That said, I agree with:

> The advantage of pre-processed json files is that when you have a single 
> theorem you want to look at you don't need to request everything and start up 
> a whole client-side MM verifier first.


> What started this thread was trying to see if we could get "free" training on 
> GPT-3 by presenting something easier for them to consume, *without* 
> controlling them. I don't think generating JSON will do that. The JSON 
> certainly could make it easier for organizations who want to specially train 
> ML systems on the existing databases, and are willing to do "a little 
> cleaning" by processing the JSON. If we pregenerate JSON from the .mm 
> databases, the most likely use cases would be to make it easier for rendering 
> tools & ML labs that want to do a little (but not a lot) of work.
> 
> Has anyone asked for this?

Absolutely not!! I was noticing that Microsoft & Google are really doubling 
down on web scanning into GPTs, and trying to see if we could take advantage of 
that.

--- David A. Wheeler

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/3E04D159-5A99-485B-8B1E-4C8524CD5D0E%40dwheeler.com.
Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to