Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

David A. Wheeler Thu, 16 Feb 2023 07:05:07 -0800


> On Feb 14, 2023, at 1:46 PM, Jon P <[email protected]> wrote:
> 
> I think that's a great idea David! Seems like very low downside and it would 
> be interesting to see if language models could train on the datasets. 
> 
> There's already an option to Customise GPT3 with additional datasets so once 
> this is ready it could be an interesting experiment to try.
> 
> https://openai.com/blog/customized-gpt-3/


Thanks so much for that link. So I looked into creating a specially-trained 
model
to create metamath proofs. that page points to this pricing model for GPT-3:
https://openai.com/api/pricing/#faq-token

I did a quick estimate of training costs for the top GPT-3 model on
all of set.mm. It comes out to about $2600.
That's less than I expected, and that's easily affordable in a grant (you could 
train multiple times),
but it's not an amount I have pocket change for :-(.
Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15 cents).
See below for how I created the quick estimate.
We could do a subset of set.mm (e.g., skip mathboxes, etc.), though fewer 
training examples
would typically produce less-good results.

I can't possibly afford that myself. That said, there are alternatives. There 
are many other
Generative Pre-trained Transformer (GPT) models that are less capable but
also less expensive. My current personal computer is useless
for ML training, but Kaggle.com provides a nice Jupyter notebook interface for 
AI/ML
work that's free to use. Here are some pages that discuss using GPT-2 and 
Kaggle:
https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners
https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook
https://www.kaggle.com/general/116387

I had also been looking at adding definition checks to metamath-knife, and 
maybe trying to
help this new JavaScript metamath proof assistant. Sigh, not enough time in the 
day :-).

--- David A. Wheeler



=== Cost estimate ===

Here's. cost estimate using https://openai.com/api/pricing/

OpenAI's most powerful GPT-3 model is Davinci. Their charges are:
* Fine-tuned training: $0.0300 / 1K tokens,
* Usage: $0.1200 / 1K tokens.

I did a quick estimate of tokens in the entirety of set.mm using:
metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show statement 
*' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------' | wc
That yielded 64218080 words. A token is about 4 characters or 0.75 words, so 
for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens.
Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569.

Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost = 
(1256/1000)*$0.12 = $0.15

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/00F786DE-04C9-49B6-B070-365827D5DDD1%40dwheeler.com.

Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to