> > I do see big downsides for the use cases listed here though: > 1. For training a GPT, I fear that the system would focus on "it's a JSON > file!" > and any tool-specific training would be swamped by thinking it's the same > as other JSON files. > I think a GPT system that trains on the "Internet as a whole" is more > likely to > generate some useful information if we minimize distractors. > In short, make it REALLY OBVIOUS that this text is regular in some way, yet > clearly different from anything else, and expose structure where we > reasonably can.
I think this is looking at it the wrong way. Stuff trained on the entire internet is going to have a clear upper bound on the precision of the answers it can give. I'm not really interested in supplying a poor quality data source just for the purpose of a speculative application like this. For projects that are actually interested in doing ML specifically on metamath, it is very normal to have some data wrangling by the researchers beforehand to prepare things for the model. This kind of glue code is usually some ad-hoc python script so it helps if we are providing a data source that is easily manipulable. (I'm working on a paper with Josef Urban regarding ATP for metamath and it benefits significantly from metamath-knife as a way of getting information from mm files, but it's not always possible to use mm-knife and if we are going for something more easily digestible from many languages without a dedicated library JSON is a strong choice.) 2. For rendering on the client side, there are big advantages to showing > static data > as straight HTML. For one, people who disable JavaScript can still see it > fine. > Sure, sometimes it's useful to run arbitrary code sent by strangers, but > there are good > reasons to minimize that as a requirement :-). It also means that search > engines see > "what we actually have". While I have some sympathy for this argument, I also think it is holding us back a lot. Catering to the least common denominator of no-JS means that our web pages are stuck permanently in web 1.0, with no search functionality, intra-page links, subproof expansion, mathml rendering, go-to-definition for symbols, or a laundry list of other features. Furthermore, doing the rendering client-side means that the cost of transmission is lower which means shorter load times and no need for a two-hour processing phase. (Even without making the web pages dynamic I've been considering generating the web pages on-demand and using nginx only as a cache. I already have a prototype that can serve HTML directly from the .mm file, no preprocessing necessary, and that means we don't have to worry about slow updates and stuff going down due to issues that occurred during the massive install script.) On Fri, Feb 17, 2023 at 4:27 AM Auguste Poiroux <[email protected]> wrote: > Hi everyone! > > I am very interested in this discussion topic too. There is an > organization that is currently working on something highly related to this > topic and is starting to release open-source datasets/models on > HuggingFace: https://huggingface.co/hoskinson-center. They provide a math > dataset and a first model of 6.7B parameters trained on it. In their math > dataset, they put several proof databases from different formal systems, > including Metamath. They used this code: > https://github.com/zhangir-azerbayev/mm-extract > to extract a "human readable" form of the proofs. This seems to be close > to the format you propose David. > I think the model they provide can be used to make proof suggestions. I > don't know what this is worth though, I just stumbled across it. It may not > be the best, but I think it can be a really good start. > The work looks very recent, maybe they will publish some evaluation > results of their model soon. > > Auguste > > Le jeudi 16 février 2023 à 16:05:01 UTC+1, David A. Wheeler a écrit : > >> >> > On Feb 14, 2023, at 1:46 PM, Jon P <[email protected]> wrote: >> > >> > I think that's a great idea David! Seems like very low downside and it >> would be interesting to see if language models could train on the datasets. >> > >> > There's already an option to Customise GPT3 with additional datasets so >> once this is ready it could be an interesting experiment to try. >> > >> > https://openai.com/blog/customized-gpt-3/ >> >> Thanks so much for that link. So I looked into creating a >> specially-trained model >> to create metamath proofs. that page points to this pricing model for >> GPT-3: >> https://openai.com/api/pricing/#faq-token >> >> I did a quick estimate of training costs for the top GPT-3 model on >> all of set.mm. It comes out to about $2600. >> That's less than I expected, and that's easily affordable in a grant (you >> could train multiple times), >> but it's not an amount I have pocket change for :-(. >> Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15 >> cents). >> See below for how I created the quick estimate. >> We could do a subset of set.mm (e.g., skip mathboxes, etc.), though >> fewer training examples >> would typically produce less-good results. >> >> I can't possibly afford that myself. That said, there are alternatives. >> There are many other >> Generative Pre-trained Transformer (GPT) models that are less capable but >> also less expensive. My current personal computer is useless >> for ML training, but Kaggle.com provides a nice Jupyter notebook >> interface for AI/ML >> work that's free to use. Here are some pages that discuss using GPT-2 and >> Kaggle: >> https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners >> >> https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook >> https://www.kaggle.com/general/116387 >> >> I had also been looking at adding definition checks to metamath-knife, >> and maybe trying to >> help this new JavaScript metamath proof assistant. Sigh, not enough time >> in the day :-). >> >> --- David A. Wheeler >> >> >> >> === Cost estimate === >> >> Here's. cost estimate using https://openai.com/api/pricing/ >> >> OpenAI's most powerful GPT-3 model is Davinci. Their charges are: >> * Fine-tuned training: $0.0300 / 1K tokens, >> * Usage: $0.1200 / 1K tokens. >> >> I did a quick estimate of tokens in the entirety of set.mm using: >> metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show >> statement *' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------' >> | wc >> That yielded 64218080 words. A token is about 4 characters or 0.75 words, >> so for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens. >> Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569. >> >> Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost = >> (1256/1000)*$0.12 = $0.15 >> > -- > You received this message because you are subscribed to the Google Groups > "Metamath" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com > <https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Metamath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/CAFXXJSuRZBQ0E%2BLgGmMEgMUhsw5wFkLsL7%3DGe62w0gfxUBKPkw%40mail.gmail.com.
