What is canonical human knowledge? You may recall me suggesting the Wikipedia change log as the Hutter Prize corpus which would have proven impractical in 2005 (as Marcus pointed out and as you may have as well).
That would have been to assist in forensic epistemology: Discovering, not only the rampant generators of bias that were becoming obvious in Wikipedia back in 2005, but, the morphisms between various "schools of thought": A Rosetta Stone of Human Knowledge (which was also a reason I suggested including the other language versions of Wikipedia). Canonical human knowledge would include, of course, identities latent in the data as generators of bias. But it would also include scientific discovery which sometimes arises when people get past inappropriate use of symbols in technical languages, many of which are represented in Wikipedia. See https://spasim.org/docs/leibniz_quine_etter_identity.html for something Tom Etter* and I were working toward when we were both basically booted from Silicon Valley by H-1b fraud. Language isn't "just" language. *I didn't know until years after Tom's death that he and Solomonoff were friends and arrived early at the Dartmouth Summer of AI together. I didn't even hear from Tom about Solomonoff or Algorithmic Information Theory and related concepts. I only hired Tom to do this kind of work because I saw a problem with mathematical foundations going back to my work at VIEWTRON where I was the future's architect responsible for establishing what might have been _the_ computer network protocol we all live with today. That's why I went to work at HP's "Internet Chapter 2" project despite their story of what they were doing making little sense to me. On Wed, Nov 26, 2025 at 10:35 AM Matt Mahoney <[email protected]> wrote: > Besides language model evaluation, what are some examples of questions you > want to answer using lossless data compression? > > -- Matt Mahoney, [email protected] > > On Tue, Nov 25, 2025, 10:39 PM James Bowery <[email protected]> wrote: > >> >> >> On Mon, Nov 24, 2025 at 9:05 AM Matt Mahoney <[email protected]> >> wrote: >> >>> ... >>> Which raises the even bigger problem that as you mentioned, motivation, >>> ego, and money drive science. Scientists who should know better still want >>> to prove themselves right... >>> >> >> This holds also for scientists who want to prove that it is hopeless to >> hold them to account with an objective model selection criterion. >> >> Not only is that motivation enormous, it requires almost no motivation at >> all since those in power can't be held to account by those without power -- >> so, even if they are so foolish as to engage the powerless in argument, >> they can make BS arguments respond to any counter-argument with more BS. >> This is being automated with LLMs on a mass scale now that Turing's BS test >> has been passed. >> >> >>> Suppose you want to answer the question of whether covid-19 vaccines are >>> safe and effective... >>> >> >> That's not what large models are for. Large models either answer an >> enormous range of questions effectively because they have an effective >> world model or they are narrow pre-programmed small models that do >> simulations based on human expert specifications; merely encoding prior >> expert knowledge in simulation algorithms. >> >> The data set huge. >> >> As I said, there is a huge difference between the data that go into >> climate model and the data that go into macrosocial psychology models such >> as those upon which you base your argument in the OP. >> >> >>> ...Do you trust the US CDC? Do you trust the Chinese CDC? Do you trust >>> Turkmenistan, the only country to report zero cases throughout the >>> pandemic? Who gets to decide which data to include? >>> >> >> Data and models are in different categories therefore data selection >> criteria and model selection criteria are in different categories. I >> addressed this in the README at >> https://github.com/jabowery/HumesGuillotine >> >> >>> How do you convince people who believe that the moon landing was fake? >>> >> >> You don't. What you do is convince decisionmakers to take information >> criteria for model selection seriously enough to apply algorithmic >> information theory. >> >> As to the uncomputability of proving one has found the best possible >> scientific model for a given dataset leading to a potentially bottomless >> pit of resources being poured down the science rat hole: Precisely! >> That's why funding authorities need criteria that holds those receiving the >> science funding objectively accountable and in such a manner that they >> don't have to worry about leaked evaluation datasets. >> >> -- Matt Mahoney, [email protected] >>> >>> On Sun, Nov 23, 2025, 10:30 AM James Bowery <[email protected]> wrote: >>> >>>> There are, of course, an infinite number of "arguments" one can come up >>>> with to expand what Nick Szabo calls the "Argument Surface" and that is >>>> where the real "problem for statistics about people" arises -- not in the >>>> choice of language ambiguity. People who are not motivated to get rid of >>>> motivated reasoning will not be motivated to solve problems like the choice >>>> of language ambiguity -- as just one example of many. I will grant, >>>> however, that particular redoubt is only for the elect who, like you and I, >>>> have been involved with judging the Hutter Prize. IIRC, even Shane Legg >>>> sets forth that argument as a reason to avoid the ALgorithmic Information >>>> Criterion -- and you can't get much more authoritative than that unless you >>>> go to Hutter himself or, in the hypothetical case, Solomonoff. I did >>>> express concern to Marcus at one time, when Solomonoff was still living and >>>> shortly after the Hutter Prize had been announced, that Solomonoff might >>>> "torpedo" the Hutter Prize with that argument (if I recall the exact >>>> wording). Marcus reassured me that Solomonoff would do no such thing. >>>> IIRC shortly thereafter Solomoff posted something like that argument to his >>>> blog. IIRC Marcus objected to using the ALIC for global warming despite >>>> the Biden administration setting the value of addressing that issue at >>>> around $10T/year -- and I can see merit in that objection given the scale >>>> of the data. >>>> >>>> But it all comes down to "incentives" when we are addressing the >>>> "motivated reasoning" problem and that's why I posted my Congressional >>>> testimony about the "incentives" regarding rocket technology -- which you >>>> commented on but did not seem to get the point I was trying to make about >>>> incentives. >>>> >>>> Once we're in the realm of macrosocial psychological dynamical models, >>>> the incentives are so great as to beggar the imagination. This is far >>>> greater even than Biden's rNPV of $10T/year and the macrosocial psychology >>>> data is many orders of magnitude smaller than climate data. That said, >>>> there is room for your concern about choice of language in conjunction with >>>> the identification "noise" regarding which, as I've often pointed out: >>>> "one man's noise is another man's cyphertext". >>>> >>>> So we have two "argument surfaces" here: >>>> >>>> How much of the macrosocial dataset is "*noise*" as opposed to >>>> inadequately motivated forensic epistemology "decyphering" that noise? >>>> >>>> How much of the wiggle room for *choice of language *can be squeezed >>>> out by forensic epistemology motivated by an rNPV of $10T/year, ie: well in >>>> excess of $100T, with let's say only 1% of that amount going to ALIC >>>> research: >$1T? >>>> >>>> First of all, recognize that the exploit you regard is decisive >>>> is miniscule compared to the argument surface presently not only tolerated >>>> but exploited by the academy, think tanks and punditry. At present there >>>> is virtually nothing BUT macrosocial psychological "argument surface", e.g. >>>> arguments such as the one to which you appealed for normative alignment of >>>> young men to be optimistic lest their pessimism be a self fulfilling >>>> prophecy. >>>> >>>> Secondly, forensic epistemology is precisely about *presuming* criminal >>>> behavior such as that to which you appeal as a reason for despair. With >>>> >$1T at stake there will be enormous motivation to suss out issues >>>> regarding "language choice" and I can easily demonstrate that none of the >>>> existing authorities have been sufficiently motivated to reduce that aspect >>>> of the argument surface: >>>> >>>> As I've pointed out before, not only is there an entirely different >>>> theoretical basis for addressing that reason (really excuse) to support >>>> avoidance of scientific accountability by our policy makers (ie: NiNOR >>>> Complexity), but there are obvious, at-hand, techniques to reduce that >>>> argument surface. For example, a GPU provides an "instruction set", ie >>>> "language", that is radically different from a CPU. So are we to now throw >>>> up our hands in despair and let those in power get away with "Well gee who >>>> could have KNOWN???" when things don't go "according to projections"? >>>> Really? Why am I the ONLY person to have addressed the *obvious* fact >>>> that a GPU's "instruction set" is describable as a relatively tiny >>>> procedure in a canonical instruction set and that procedure's algorithmic >>>> length should be used? >>>> >>>> Could it be that, perhaps, I'm the only sufficiently MOTIVATED person >>>> among those who have been taking information criteria remotely seriously? >>>> >>>> >>>> On Thu, Nov 20, 2025 at 5:27 PM Matt Mahoney <[email protected]> >>>> wrote: >>>> >>>>> On Thu, Nov 20, 2025, 10:11 AM James Bowery <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Nov 19, 2025 at 11:19 AM Matt Mahoney < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Algorithmic information or compression is great for evaluating >>>>>>> language models but not for everything.... >>>>>>> >>>>>>> I could try compressing world population data by fitting it to a >>>>>>> polynomial, >>>>>>> >>>>>> >>>>>> Do you understand the difference between statistics and dynamics? >>>>>> >>>>> >>>>> No, it's the difference between compressing text and compressing >>>>> video. You can't accurately measure the compression of a tiny signal in a >>>>> sea of noise. >>>>> >>>>> This becomes a problem for statistics about people. It only takes a >>>>> few bits of Kolmogorov complexity for social scientists to construct >>>>> models >>>>> that favor one group over another, and those bits can be hidden in the >>>>> choice of language ambiguity. >>>>> >>>>> I think it would be great if we could answer political questions >>>>> objectively. So how would you solve the problem? >>>>> >>>>> >>>>>> <https://agi.topicbox.com/groups/agi/T504adacb23f3c455-Md49fd5f054dbc9f5d8062388> >>>>>> >>>>> -- Matt Mahoney, [email protected] >>>>> >>>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + > delivery options <https://agi.topicbox.com/groups/agi/subscription> > Permalink > <https://agi.topicbox.com/groups/agi/T504adacb23f3c455-Me4956f20775449907a85e761> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T504adacb23f3c455-M89c00dac545766029563b68a Delivery options: https://agi.topicbox.com/groups/agi/subscription
