Hi everyone! I am very interested in this discussion topic too. There is an organization that is currently working on something highly related to this topic and is starting to release open-source datasets/models on HuggingFace: https://huggingface.co/hoskinson-center. They provide a math dataset and a first model of 6.7B parameters trained on it. In their math dataset, they put several proof databases from different formal systems, including Metamath. They used this code: https://github.com/zhangir-azerbayev/mm-extract to extract a "human readable" form of the proofs. This seems to be close to the format you propose David. I think the model they provide can be used to make proof suggestions. I don't know what this is worth though, I just stumbled across it. It may not be the best, but I think it can be a really good start. The work looks very recent, maybe they will publish some evaluation results of their model soon.
Auguste Le jeudi 16 février 2023 à 16:05:01 UTC+1, David A. Wheeler a écrit : > > > On Feb 14, 2023, at 1:46 PM, Jon P <[email protected]> wrote: > > > > I think that's a great idea David! Seems like very low downside and it > would be interesting to see if language models could train on the datasets. > > > > There's already an option to Customise GPT3 with additional datasets so > once this is ready it could be an interesting experiment to try. > > > > https://openai.com/blog/customized-gpt-3/ > > Thanks so much for that link. So I looked into creating a > specially-trained model > to create metamath proofs. that page points to this pricing model for > GPT-3: > https://openai.com/api/pricing/#faq-token > > I did a quick estimate of training costs for the top GPT-3 model on > all of set.mm. It comes out to about $2600. > That's less than I expected, and that's easily affordable in a grant (you > could train multiple times), > but it's not an amount I have pocket change for :-(. > Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15 > cents). > See below for how I created the quick estimate. > We could do a subset of set.mm (e.g., skip mathboxes, etc.), though fewer > training examples > would typically produce less-good results. > > I can't possibly afford that myself. That said, there are alternatives. > There are many other > Generative Pre-trained Transformer (GPT) models that are less capable but > also less expensive. My current personal computer is useless > for ML training, but Kaggle.com provides a nice Jupyter notebook interface > for AI/ML > work that's free to use. Here are some pages that discuss using GPT-2 and > Kaggle: > https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners > > https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook > https://www.kaggle.com/general/116387 > > I had also been looking at adding definition checks to metamath-knife, and > maybe trying to > help this new JavaScript metamath proof assistant. Sigh, not enough time > in the day :-). > > --- David A. Wheeler > > > > === Cost estimate === > > Here's. cost estimate using https://openai.com/api/pricing/ > > OpenAI's most powerful GPT-3 model is Davinci. Their charges are: > * Fine-tuned training: $0.0300 / 1K tokens, > * Usage: $0.1200 / 1K tokens. > > I did a quick estimate of tokens in the entirety of set.mm using: > metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show > statement *' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------' > | wc > That yielded 64218080 words. A token is about 4 characters or 0.75 words, > so for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens. > Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569. > > Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost = > (1256/1000)*$0.12 = $0.15 > -- You received this message because you are subscribed to the Google Groups "Metamath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com.
