Re: Is AI running out of training data?

Alan Grayson Thu, 12 Dec 2024 23:29:42 -0800

On Thursday, December 12, 2024 at 7:38:11 PM UTC-7 Brent Meeker wrote:

Magic is always the explanation of those who can't understand.


Brent


*There's plenty of magic, under a different name, in physics. Another 
pitfall is religating hidden knowledge, aka occult knowledge, such as the 
Chakras in Yoga, to de facto magic or someone's overactive imagination. AG *

On 12/12/2024 1:39 PM, 'Cosmin Visan' via Everything List wrote:

Magic!

On Thursday, 12 December 2024 at 20:00:58 UTC+2 John Clark wrote:

*The number of "tokens" (words or parts of words) used to train LLMs is 100 
times larger than it was in 2020, the largest are now using tens of 
trillions.  if you only consider text then the entire Internet only 
contains about 3,100 trillion tokens. The amount of text LLMs train on is 
doubling every year but the amount of human generated text on the Internet 
is only growing at about 10% a year, if that trend continues AIs will run 
out of text somewhere around 2028.  Does that mean AI progress is about to 
hit a wall? I don't think so for the following reasons:*

*For one thing, because of improvements in algorithms, the computing power 
needed for a Large Language Model  to achieve the same performance has 
halved about every 8 months. *

*ALGORITHMIC PROGRESS IN LANGUAGE MODELS* <https://arxiv.org/pdf/2403.05812>


*And computer chips specialized for AI rather than general computing, like 
those made by Nvidia and other companies, are getting faster even more 
rapidly than Moore's Law. Also, the rate of growth of specialized data 
sets, such as astronomical and biological data, are growing much much more 
quickly than text is; that's how AIs got so good at predicting how proteins 
fold up. *

*And there is vastly more information if AI's are trained on other types of 
data besides text, and some AI's are already being trained on unlabeled 
images and videos.  Yann LeCun, chief AI scientist at Meta, said that 
"although the 10^13  tokens used to train a LLM  sounds like a lot  (it 
would take a human 170,000 years to read that much) , a 4-year-old child 
has absorbed a volume of data 50 times greater than that just by looking at 
objects during his waking hours. We’re never going to get to human-level AI 
by just training on language, that’s just not happening".* 

*And then there's synthetic data. AlphaGeometry was trained to solve 
geometry problems using 100 million computer generated synthetic examples 
with no human demonstrations, and it ended up being as good at solving 
difficult geometry problems as the very best high school students in the 
entire nation. *

*Solving olympiad geometry without human demonstrations* 
<https://www.nature.com/articles/s41586-023-06747-5>

*AI researchers are starting to change their strategy and have their AI's 
reread their training set many times because AI's operate in a statistical 
way so rereading improves performance *


*Scaling Data-Constrained Language Models* 
<https://arxiv.org/pdf/2305.16264>


*Andy Zou at Carnegie Mellon University says  "once  an AI has got a 
foundational knowledge base that’s probably greater than any single person 
could have, it no longer needs more data to get smarter. It just needs to 
sit and think. I think we’re probably pretty close to that point.”*

*John K Clark    See what's on my new list at  Extropolis 
<https://groups.google.com/g/extropolis>*


-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/4f1950cb-f5be-4b55-a56a-5471fb95af43n%40googlegroups.com.

Re: Is AI running out of training data?

Reply via email to