https://youtube.com/clip/UgkxKIfQA8UpbuzBp2ahwAxXkIPxqJtGGRRn?si=hXSAi7XfM9lbg0nf
One way of viewing this is just that by introducing noise into gradient descent once can avoid local minima. Another way of viewing it is that so-called "misinformation" can teach critical thinking so long as there is enough countervailing information to overcome the misinformation. This latter view is associated with the *specious* view that "bias" about what IS the case in language models arises from the "bias" in the population that generated the *randomly* curated training data because the population's "bias" is *not* random. This view is specious for the same reason that science is not democratic -- it engages in critical thinking that takes into account the broadest range of data practical in order to find a consistent or canonical body of knowledge -- a world model of maximum parsimony. The fact that a population may be "voting" in a sense for misinformation becomes merely more phenomena to be modeled as data. The model of that population models the bias itself as knowledge. This is why lossless compression of Wikipedia will result not only in distilled knowledge about the topics of the articles, but about the biases of the editors so as to more parsimoniously represent that corpus. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Ta0524e111294e22e-M31a197f019fde93c9c91b542 Delivery options: https://agi.topicbox.com/groups/agi/subscription