Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

David A. Wheeler Sat, 18 Feb 2023 10:09:55 -0800


David A. Wheeler:
> I do see big downsides for the use cases listed here though:
> 1. For training a GPT, I fear that the system would focus on "it's a JSON 
> file!" ....

> On Feb 17, 2023, at 12:49 PM, Mario Carneiro <[email protected]> wrote:
> I think this is looking at it the wrong way. Stuff trained on the entire 
> internet is going to have a clear upper bound on the precision of the answers 
> it can give. I'm not really interested in supplying a poor quality data 
> source just for the purpose of a speculative application like this. For 
> projects that are actually interested in doing ML specifically on metamath, 
> it is very normal to have some data wrangling by the researchers beforehand 
> to prepare things for the model. This kind of glue code is usually some 
> ad-hoc python script so it helps if we are providing a data source that is 
> easily manipulable.

Sure, ML training normally does a lot of wrangling to make clean data,. I've 
done some ML work myself (not on Metamath, sadly). But since my wallet isn't 
deep, I was speculating that by providing easier-to-train-on text we might get 
some more value out of the investments others are making. I think it'd be 
relatively easy to do, and then we could see if it has a benefit.

> (I'm working on a paper with Josef Urban regarding ATP for metamath and it 
> benefits significantly from metamath-knife as a way of getting information 
> from mm files, but it's not always possible to use mm-knife and if we are 
> going for something more easily digestible from many languages without a 
> dedicated library JSON is a strong choice.)

David A. Wheeler:
> 2. For rendering on the client side, there are big advantages to showing 
> static data
> as straight HTML. For one, people who disable JavaScript can still see it 
> fine.
> Sure, sometimes it's useful to run arbitrary code sent by strangers, but 
> there are good
> reasons to minimize that as a requirement :-). It also means that search 
> engines see
> "what we actually have".
>  

> While I have some sympathy for this argument, I also think it is holding us 
> back a lot. Catering to the least common denominator of no-JS means that our 
> web pages are stuck permanently in web 1.0, with no search functionality, 
> intra-page links, subproof expansion, mathml rendering, go-to-definition for 
> symbols, or a laundry list of other features. Furthermore, doing the 
> rendering client-side means that the cost of transmission is lower which 
> means shorter load times and no need for a two-hour processing phase.

What I want is for *a* way for people to access the basic information without 
*requiring* client-side JavaScript. I see no reason that it be the *only* way 
we provide the information, we already provide it in multiple forms. I just 
want to retain that capability. So you don't need to feel held back :-).

Indeed, you can implement rendering with client-side JavaScript today, just 
download the .mm files. The current CORS setting means that any client-side 
JavaScript program can do that, it doesn't even need to be hosted on 
us.metamath.org. So technically people today can view the generated HTML on the 
website, *or* use any client-side JavaScript/WebAssembly. We "just" need 
someone to write the client-side renderer.

Also, almost everything you listed doesn't require client-side JavaScript. 
Intra-page links work (they're just hyperlinks), subproof expansion works (use 
HTML <details>), MathML ships in Chromium & Chrome & Firefox (Firefox's support 
is only partial but it's probably more than enough for us) per 
<https://developer.mozilla.org/en-US/docs/Web/MathML>, go-to-definition is just 
another hyperlink, and so on. Basic searches can be supported via a form that 
queries a search engine (e.g., Google).

As far as speed goes, JavaScript is often slower than HTML where both can be 
used. In practice JavaScript files are often quite large, and JavaScript has to 
be parsed, examined, executed, and then its execution must cause document tree 
changes which finally get rendered. Web browsers are *really* good at rapidly & 
progressively rendering HTML. On systems I've worked with, HTML (even with 
server roundtrips) is often *MUCH* faster when measuring paint-to-screen times. 
JavaScript *has* gotten faster over the years, thankfully, but the gap is still 
large. Of course, when you need functionalities only provided by JavaScript 
then you need it. Like all technologies there are tradeoffs.

> On Feb 17, 2023, at 3:52 PM, Cris Perdue <[email protected]> wrote:

> It is a tiny percentage of users who turn off JavaScript. 

I'm one of those people who sometimes turns off JavaScript, and many sites work 
fine. The more you learn about security, the more worrying it gets to "run all 
code sent to your web browser from attackers", especially since all browsers' 
JavaScript engines are large (& thus hard to review & have endless 
vulnerabilities), and websites often have no idea what JavaScript they're 
sending to you (much comes from ad brokers from other ad brokers from anyone 
who want to make someone run code they created). Let's try to support users who 
are trying to protect themselves.

> Is there much reason not to create JSON in addition to HTML? If disk space is 
> a concern, consider that it can be a win-win to store precompressed pages 
> (i.e. with gzip). 

No *big* reason. Indeed, if we want to make it *easy* for people to generate 
their own renderers or analyze Metamath results without specialized parsing, 
the obvious way to do that would be to generate JSON. I think that'd make a lot 
of sense. That still involves a daily pre-run, but I don't think there's any 
serious problem in doing 1/day regenerations.

We'd have to figure out specifics, which means we need to figure out how they'd 
be used. If we generate a separate JSON file for each theorem, then it's really 
easy for a rendered to get just what it needs. However, that's a lot of files, 
and compression might not help our storage needs. The filesystem block size is 
4KiB, so anything under 4KiB is 4KiB storage. We could generate a single 
massive JSON file of all theorems; that might be useful for ML training if they 
plan to process the JSON further, but the file would be huge so I think a 
client-side program would be better off loading the .mm database directly 
instead. A massive JSON file might be useful for other purposes, and that 
*would* be best stored compressed.

We currently don't have *lots* of extra storage space. If we add a major new 
set of generated data it's likely we'll need more space. We could retire 
mpegif, add more storage ($ that currently I pay, but maybe others could chip 
in), or look for another provider ($?). 

What started this thread was trying to see if we could get "free" training on 
GPT-3 by presenting something easier for them to consume, *without* controlling 
them. I don't think generating JSON will do that. The JSON certainly could make 
it easier for organizations who want to specially train ML systems on the 
existing databases, and are willing to do "a little cleaning" by processing the 
JSON. If we pregenerate JSON from the .mm databases, the most likely use cases 
would be to make it easier for rendering tools & ML labs that want to do a 
little (but not a lot) of work.

--- David A. Wheeler

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/7271FCD5-4BF4-4E47-9D30-9AB658C55204%40dwheeler.com.

Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to