On Wed, Nov 26, 2025 at 5:50 PM Matt Mahoney <[email protected]>
wrote:

> ... GB is as much as a human could read over a lifetime, and therefore
> should be enough to train a language model to human level. Of course LLMs
> train on much larger sets, which makes them far more knowledgeable than any
> human.
>

Your benchmark using only the 1GB 20 year old snapshot of Wikipedia, if
incentivized under the Genesis Mission would go a long way toward
accomplishing it.  $100M would be the minimum underwriting by my estimate
required.

...
>
> Besides AI, what questions could you answer by a compression contest and
> what data would you use?
>

Don't conflate the lossless compression prize with what people are thinking
of as "AI".  It is forensic epistemology.  Human intervention is essential
at least at this stage.  You aren't going to get a 120 bit lagrangian of
the big bang localized to the context of Wikipedia's generation circa
2005.  The point is to get people to stop their damn yammering at each
other while we drift toward a rhyme with the 30 Years War over our beliefs,
and start operationalizing what they're saying -- and not just about "the
world" devoid of human actors subverting human knowledge with their "edits".


> I did some work on your Laboratory of the Counties a couple of years ago.
> Have you made any discoveries from this data?
>

That was originally my attempt to get Charlie Smith (Tukey's student and
ask Hinton about his funding the second connectionist summer) to use his
connections to reform the social sciences.  I had finally, after 20 years,
gotten him to understand that algorithmic information could be a superior
information criterion to all others.  But this was in the midst of the
Trump upset of 2016 and Charlie was and still is solidly on "Team Blue".
I, being solidly NOT on "Team Blue" started to lose traction despite the
fact that I was trying to find a neutral ground between Team Red and Team
Blue based on the THIRD connectionist summer -- and thereby avert a 30
Years War.

The biggest part of the problem I've found is that people on Team Blue,
despite all their hair-on-fire hysterics about Team Red pulling out their
guns and bibles any minute to round up anyone who isn't an Aryan Superman
Hitler Wet Dream, when it comes right down to it are certain that Team Red
will not resort to violence to protect individual moral agency against The
Unfriendly AGI known as The Global Economy.  They are certain, as
apparently are you, that Everything Is Under Control (as Robert Anton
Wilson wryly titled his critique).  So I'd really like to wake you guys up
if that is at all possible.  Killing 25% of the population just so people
can feel like they have some control over their local communities should
not be necessary.

As far as my own work on that dataset, yes, I've made some discoveries but
they're mainly about how to properly estimate the algorithmic information
of a model vs residual errors based on instrument precision and systematic
tracking of the jacobiians into the compressed representation and back out
to its reconstruction.

Maybe the most important discovery is that my intuition that one could
extract dynamics from that dataset, despite being primarily spatial in
nature, has now been vindicated by bioinformatics in the form of virtual
time to infer cellular development trajectories with differential
equations.  I have a model of county dynamics using information geometry to
discover the state space and then back that out to impute the >90% missing
data in the full time series panel so that I can then discover the
differential equations.  I haven't released that code yet but it is
actually something that the social sciences haven't gotten around to doing
yet.







>
> -- Matt Mahoney, [email protected]
>
> On Wed, Nov 26, 2025, 4:24 PM James Bowery <[email protected]> wrote:
>
>> What is canonical human knowledge?
>>
>> You may recall me suggesting the Wikipedia change log as the Hutter Prize
>> corpus which would have proven impractical in 2005 (as Marcus pointed out
>> and as you may have as well).
>>
>> That would have been to assist in forensic epistemology:
>>
>> Discovering, not only the rampant generators of bias that were becoming
>> obvious in Wikipedia back in 2005, but, the morphisms between various
>> "schools of thought":  A Rosetta Stone of Human Knowledge (which was also a
>> reason I suggested including the other language versions of Wikipedia).
>>
>> Canonical human knowledge would include, of course, identities latent in
>> the data as generators of bias.  But it would also include scientific
>> discovery which sometimes arises when people get past inappropriate use of
>> symbols in technical languages, many of which are represented in Wikipedia.
>>
>> See https://spasim.org/docs/leibniz_quine_etter_identity.html for
>> something Tom Etter* and I were working toward when we were both basically
>> booted from Silicon Valley by H-1b fraud.  Language isn't "just" language.
>>
>> *I didn't know until years after Tom's death that he and Solomonoff were
>> friends and arrived early at the Dartmouth Summer of AI together.  I didn't
>> even hear from Tom about Solomonoff or Algorithmic Information Theory and
>> related concepts. I only hired Tom to do this kind of work because I saw a
>> problem with mathematical foundations going back to my work at VIEWTRON
>> where I was the future's architect responsible for establishing what might
>> have been _the_ computer network protocol we all live with today.  That's
>> why I went to work at HP's "Internet Chapter 2" project despite their story
>> of what they were doing making little sense to me.
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2025 at 10:35 AM Matt Mahoney <[email protected]>
>> wrote:
>>
>>> Besides language model evaluation, what are some examples of questions
>>> you want to answer using lossless data compression?
>>>
>>> -- Matt Mahoney, [email protected]
>>>
>>> On Tue, Nov 25, 2025, 10:39 PM James Bowery <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Nov 24, 2025 at 9:05 AM Matt Mahoney <[email protected]>
>>>> wrote:
>>>>
>>>>> ...
>>>>> Which raises the even bigger problem that as you mentioned,
>>>>> motivation, ego, and money drive science. Scientists who should know 
>>>>> better
>>>>> still want to prove themselves right...
>>>>>
>>>>
>>>> This holds also for scientists who want to prove that it is hopeless to
>>>> hold them to account with an objective model selection criterion.
>>>>
>>>> Not only is that motivation enormous, it requires almost no motivation
>>>> at all since those in power can't be held to account by those without power
>>>> -- so, even if they are so foolish as to engage the powerless in argument,
>>>> they can make BS arguments respond to any counter-argument with more BS.
>>>> This is being automated with LLMs on a mass scale now that Turing's BS test
>>>> has been passed.
>>>>
>>>>
>>>>> Suppose you want to answer the question of whether covid-19 vaccines
>>>>> are safe and effective...
>>>>>
>>>>
>>>> That's not what large models are for.  Large models either answer an
>>>> enormous range of questions effectively because they have an effective
>>>> world model or they are narrow pre-programmed small models that do
>>>> simulations based on human expert specifications; merely encoding prior
>>>> expert knowledge in simulation algorithms.
>>>>
>>>> The data set huge.
>>>>
>>>> As I said, there is a huge difference between the data that go into
>>>> climate model and the data that go into macrosocial psychology models such
>>>> as those upon which you base your argument in the OP.
>>>>
>>>>
>>>>> ...Do you trust the US CDC? Do you trust the Chinese CDC? Do you
>>>>> trust Turkmenistan, the only country to report zero cases throughout the
>>>>> pandemic? Who gets to decide which data to include?
>>>>>
>>>>
>>>> Data and models are in different categories therefore data selection
>>>> criteria and model selection criteria are in different categories.  I
>>>> addressed this in the README at
>>>> https://github.com/jabowery/HumesGuillotine
>>>>
>>>>
>>>>> How do you convince people who believe that the moon landing was fake?
>>>>>
>>>>
>>>> You don't.  What you do is convince decisionmakers to take information
>>>> criteria for model selection seriously enough to apply algorithmic
>>>> information theory.
>>>>
>>>> As to the uncomputability of proving one has found the best possible
>>>> scientific model for a given dataset leading to a potentially bottomless
>>>> pit of resources being poured down the science rat hole:  Precisely!
>>>> That's why funding authorities need criteria that holds those receiving the
>>>> science funding objectively accountable and in such a manner that they
>>>> don't have to worry about leaked evaluation datasets.
>>>>
>>>> -- Matt Mahoney, [email protected]
>>>>>
>>>>> On Sun, Nov 23, 2025, 10:30 AM James Bowery <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> There are, of course, an infinite number of "arguments" one can come
>>>>>> up with to expand what Nick Szabo calls the "Argument Surface" and that 
>>>>>> is
>>>>>> where the real "problem for statistics about people" arises -- not in the
>>>>>> choice of language ambiguity.  People who are not motivated to get rid of
>>>>>> motivated reasoning will not be motivated to solve problems like the 
>>>>>> choice
>>>>>> of language ambiguity -- as just one example of many.  I will grant,
>>>>>> however, that particular redoubt is only for the elect who, like you and 
>>>>>> I,
>>>>>> have been involved with judging the Hutter Prize.  IIRC, even Shane Legg
>>>>>> sets forth that argument as a reason to avoid the ALgorithmic Information
>>>>>> Criterion -- and you can't get much more authoritative than that unless 
>>>>>> you
>>>>>> go to Hutter himself or, in the hypothetical case, Solomonoff.  I did
>>>>>> express concern to Marcus at one time, when Solomonoff was still living 
>>>>>> and
>>>>>> shortly after the Hutter Prize had been announced, that Solomonoff might
>>>>>> "torpedo" the Hutter Prize with that argument (if I recall the exact
>>>>>> wording).  Marcus reassured me that Solomonoff would do no such thing.
>>>>>> IIRC shortly thereafter Solomoff posted something like that argument to 
>>>>>> his
>>>>>> blog.  IIRC Marcus objected to using the ALIC for global warming despite
>>>>>> the Biden administration setting the value of addressing that issue at
>>>>>> around $10T/year -- and I can see merit in that objection given the scale
>>>>>> of the data.
>>>>>>
>>>>>> But it all comes down to "incentives" when we are addressing the
>>>>>> "motivated reasoning" problem and that's why I posted my Congressional
>>>>>> testimony about the "incentives" regarding rocket technology -- which you
>>>>>> commented on but did not seem to get the point I was trying to make about
>>>>>> incentives.
>>>>>>
>>>>>> Once we're in the realm of macrosocial psychological dynamical
>>>>>> models, the incentives are so great as to beggar the imagination.  This 
>>>>>> is
>>>>>> far greater even than Biden's rNPV of $10T/year and the macrosocial
>>>>>> psychology data is many orders of magnitude smaller than climate data.
>>>>>> That said, there is room for your concern about choice of language in
>>>>>> conjunction with the identification "noise" regarding which, as I've 
>>>>>> often
>>>>>> pointed out:  "one man's noise is another man's cyphertext".
>>>>>>
>>>>>> So we have two "argument surfaces" here:
>>>>>>
>>>>>> How much of the macrosocial dataset is "*noise*" as opposed to
>>>>>> inadequately motivated forensic epistemology "decyphering" that noise?
>>>>>>
>>>>>> How much of the wiggle room for *choice of language *can be squeezed
>>>>>> out by forensic epistemology motivated by an rNPV of $10T/year, ie: well 
>>>>>> in
>>>>>> excess of $100T, with let's say only 1% of that amount going to ALIC
>>>>>> research: >$1T?
>>>>>>
>>>>>> First of all, recognize that the exploit you regard is decisive
>>>>>> is miniscule compared to the argument surface presently not only 
>>>>>> tolerated
>>>>>> but exploited by the academy, think tanks and punditry.  At present there
>>>>>> is virtually nothing BUT macrosocial psychological "argument surface", 
>>>>>> e.g.
>>>>>> arguments such as the one to which you appealed for normative alignment 
>>>>>> of
>>>>>> young men to be optimistic lest their pessimism be a self fulfilling
>>>>>> prophecy.
>>>>>>
>>>>>> Secondly, forensic epistemology is precisely about *presuming* criminal
>>>>>> behavior such as that to which you appeal as a reason for despair.  With
>>>>>> >$1T at stake there will be enormous motivation to suss out issues
>>>>>> regarding "language choice" and I can easily demonstrate that none of the
>>>>>> existing authorities have been sufficiently motivated to reduce that 
>>>>>> aspect
>>>>>> of the argument surface:
>>>>>>
>>>>>> As I've pointed out before, not only is there an entirely different
>>>>>> theoretical basis for addressing that reason (really excuse) to support
>>>>>> avoidance of  scientific accountability by our policy makers (ie: NiNOR
>>>>>> Complexity), but there are obvious, at-hand, techniques to reduce that
>>>>>> argument surface.   For example, a GPU provides an "instruction set", ie
>>>>>> "language", that is radically different from a CPU.  So are we to now 
>>>>>> throw
>>>>>> up our hands in despair and let those in power get away with "Well gee 
>>>>>> who
>>>>>> could have KNOWN???" when things don't go "according to projections"?
>>>>>> Really?  Why am I the ONLY person to have addressed the *obvious* fact
>>>>>> that a GPU's "instruction set" is describable as a relatively tiny
>>>>>> procedure in a canonical instruction set and that procedure's algorithmic
>>>>>> length should be used?
>>>>>>
>>>>>> Could it be that, perhaps, I'm the only sufficiently MOTIVATED person
>>>>>> among those who have been taking information criteria remotely seriously?
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 20, 2025 at 5:27 PM Matt Mahoney <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Nov 20, 2025, 10:11 AM James Bowery <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 19, 2025 at 11:19 AM Matt Mahoney <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Algorithmic information or compression is great for evaluating
>>>>>>>>> language models but not for everything....
>>>>>>>>>
>>>>>>>>> I could try compressing world population data by fitting it to a
>>>>>>>>> polynomial,
>>>>>>>>>
>>>>>>>>
>>>>>>>> Do you understand the difference between statistics and dynamics?
>>>>>>>>
>>>>>>>
>>>>>>> No, it's the difference between compressing text and compressing
>>>>>>> video. You can't accurately measure the compression of a tiny signal in 
>>>>>>> a
>>>>>>> sea of noise.
>>>>>>>
>>>>>>> This becomes a problem for statistics about people. It only takes a
>>>>>>> few bits of Kolmogorov complexity for social scientists to construct 
>>>>>>> models
>>>>>>> that favor one group over another, and those bits can be hidden in the
>>>>>>> choice of language ambiguity.
>>>>>>>
>>>>>>> I think it would be great if we could answer political questions
>>>>>>> objectively. So how would you solve the problem?
>>>>>>>
>>>>>>>
>>>>>>>> <https://agi.topicbox.com/groups/agi/T504adacb23f3c455-Md49fd5f054dbc9f5d8062388>
>>>>>>>>
>>>>>>> -- Matt Mahoney, [email protected]
>>>>>>>
>>>>>> *Artificial General Intelligence List
> <https://agi.topicbox.com/latest>* / AGI / see discussions
> <https://agi.topicbox.com/groups/agi> + participants
> <https://agi.topicbox.com/groups/agi/members> + delivery options
> <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T504adacb23f3c455-Mc57014a65cb216787498d88b>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T504adacb23f3c455-Mf6b38c57f3e3c96ae6925b8c
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to