Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Auguste Poiroux Fri, 17 Feb 2023 01:27:27 -0800

Hi everyone!

I am very interested in this discussion topic too. There is an organization 
that is currently working on something highly related to this topic and is 
starting to release open-source datasets/models on HuggingFace: 
https://huggingface.co/hoskinson-center. They provide a math dataset and a 
first model of 6.7B parameters trained on it. In their math dataset, they 
put several proof databases from different formal systems, including 
Metamath. They used this code: https://github.com/zhangir-azerbayev/mm-extract 
to extract a "human readable" form of the proofs. This seems to be close to 
the format you propose David.
I think the model they provide can be used to make proof suggestions. I 
don't know what this is worth though, I just stumbled across it. It may not 
be the best, but I think it can be a really good start.
The work looks very recent, maybe they will publish some evaluation results 
of their model soon.


Auguste

Le jeudi 16 février 2023 à 16:05:01 UTC+1, David A. Wheeler a écrit :

>
> > On Feb 14, 2023, at 1:46 PM, Jon P <[email protected]> wrote:
> > 
> > I think that's a great idea David! Seems like very low downside and it 
> would be interesting to see if language models could train on the datasets. 
> > 
> > There's already an option to Customise GPT3 with additional datasets so 
> once this is ready it could be an interesting experiment to try.
> > 
> > https://openai.com/blog/customized-gpt-3/
>
> Thanks so much for that link. So I looked into creating a 
> specially-trained model
> to create metamath proofs. that page points to this pricing model for 
> GPT-3:
> https://openai.com/api/pricing/#faq-token
>
> I did a quick estimate of training costs for the top GPT-3 model on
> all of set.mm. It comes out to about $2600.
> That's less than I expected, and that's easily affordable in a grant (you 
> could train multiple times),
> but it's not an amount I have pocket change for :-(.
> Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15 
> cents).
> See below for how I created the quick estimate.
> We could do a subset of set.mm (e.g., skip mathboxes, etc.), though fewer 
> training examples
> would typically produce less-good results.
>
> I can't possibly afford that myself. That said, there are alternatives. 
> There are many other
> Generative Pre-trained Transformer (GPT) models that are less capable but
> also less expensive. My current personal computer is useless
> for ML training, but Kaggle.com provides a nice Jupyter notebook interface 
> for AI/ML
> work that's free to use. Here are some pages that discuss using GPT-2 and 
> Kaggle:
> https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners
>
> https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook
> https://www.kaggle.com/general/116387
>
> I had also been looking at adding definition checks to metamath-knife, and 
> maybe trying to
> help this new JavaScript metamath proof assistant. Sigh, not enough time 
> in the day :-).
>
> --- David A. Wheeler
>
>
>
> === Cost estimate ===
>
> Here's. cost estimate using https://openai.com/api/pricing/
>
> OpenAI's most powerful GPT-3 model is Davinci. Their charges are:
> * Fine-tuned training: $0.0300 / 1K tokens,
> * Usage: $0.1200 / 1K tokens.
>
> I did a quick estimate of tokens in the entirety of set.mm using:
> metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show 
> statement *' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------' 
> | wc
> That yielded 64218080 words. A token is about 4 characters or 0.75 words, 
> so for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens.
> Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569.
>
> Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost = 
> (1256/1000)*$0.12 = $0.15
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/35c36bf7-dba3-4f41-b4fc-c1dca91769c5n%40googlegroups.com.

Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to