Re: [agi] unFriendly AIXI... and Novamente?

Eliezer S. Yudkowsky Wed, 12 Feb 2003 14:37:30 -0800

Ben Goertzel wrote:
>
>> Your intuitions say... I am trying to summarize my impression of your
>> viewpoint, please feel free to correct me... "AI morality is a
>> matter of experiential learning, not just for the AI, but for the
>> programmers. To teach an AI morality you must give it the right
>> feedback on moral questions and reinforce the right behaviors... and
>> you must also learn *about* the deep issues of AI morality by raising
>> a young AI. It isn't pragmatically realistic to work out elaborate
>> theories of AI morality in advance; you must learn what you need to
>> know as you go along. Moreover, learning what you need to know, as
>> you go along, is a good strategy for creating a superintelligence...
>> or at least, the rational estimate of the goodness of that strategy
>> is sufficient to make it a good idea to try and create a
>> superintelligence, and there aren't any realistic strategies that are
>> better. An informal, intuitive theory of AI morality is good enough
>> to spark experiential learning in the *programmer* that carries you
>> all the way to the finish line. You'll learn what you need to know
>> as you go along. The most fundamental theoretical and design
>> challenge is making AI happen, at all; that's the really difficult
>> part that's defeated everyone else so far. Focus on making AI
>> happen. If you can make AI happen, you'll learn how to create moral
>> AI from the experience."
>
> Hmmm. This is almost a good summary of my perspective, but you've
> still not come to grips with the extent of my uncertainty ;)
>
> I am not at all SURE that "An informal, intuitive theory of AI morality
> is good enough to spark experiential learning in the *programmer* that
> carries you all the way to the finish line." where by the "finish line"
> you mean an AGI whose ongoing evolution will lead to beneficial effects
> for both humans and AGI's.
>
> I'm open to the possibility that it may someday become clear, as AGI
> work progresses, that a systematic theory of AGI morality is necessary
> in order to proceed safely.

You are, however, relying on experiential learning to tell you *whether* a systematic theory of AGI morality is necessary. This is what I meant by trying to summarize your perspective as "An informal, intuitive theory of AI morality is good enough to spark experiential learning in the *programmer* that carries you all the way to the finish line."

The problem is that if you don't have a systematic theory of AGI morality you can't know whether you *need* a systematic theory of AGI morality. For example, I have a systematic theory of AGI morality which says that a programmer doing such-and-such is likely to see such-and-such results, with the result that experiential learning by the programmer is likely to result in the programmer solving *some* necessary AGI problems - enough for the programmer to feel really enthusiastic about all the progress being made. But when I model the programmer's expectations and the AGI's actions, I find that there are some classes of foundational error which, if the programmer's expectations are such as to make this error, the AGI's actions will not contradict the programmer's expectations until it's too late.

Let me give an example. Suppose a programmer's intuitive theory of AGI morality is that the foundation of all morality is rational self-interest, and that the programmer's intuitive theory of AGI self-improvement trajectories is that AGIs are capable of improving at most linearly over time. On this theory, the programmer creates an absolutely selfish AI. I know enough evolutionary psychology to be terribly, terribly scared by this. Others may not realize just how absolutely unhuman absolutely pragmatic selfishness is, but may at least realize that an *absolutely* selfish AI would be a poor idea.

But the programmer doesn't see anything wrong. If you take an absolutely selfish entity that's relying on you for most of its information, and you tell it that behaving cooperatively serves its selfish interests, and you have enough control over it while it's growing up to slap it whenever it does something that doesn't accord with your theory of pragmatic cooperation, then the absolutely selfish AI will appear to be learning good behavior, looking much like the programmer expects, as a pragmatically selfish strategy, which is what the programmer thinks *should* be happening.

Once the AI's intelligence passes a certain point, it will realize that the programmer is mistaken on the following factual points:

1) That it is selfishly pragmatic for a superintelligence to deal with humans economically rather than converting them to computronium.

2) That the AI's intelligence growth rate is linear.

Explaining the logic behind selfishly pragmatic cooperation requires explaining "deception" as a game-theoretical strategy that certain forms of cooperation are intended to counter. So at this point, the AI has the conceptual equipment to exploit the programmer. For example, the AI might invent a plausible-sounding theory of AI-human selfishly pragmatic cooperation creating a trans-solar economy of enormous wealth and perfect self-reliance, then feed it to the programmer as the AI's current "thoughts on the subject".

The AI, finding that its intelligence had increased to substantially beyond human intelligence, could deduce that the programmer expected an increase to at most the order of, say, Einstein. So the AI tells the programmer that it can design an immensely improved antibiotic if it's allowed to experiment with protein synthesis, which will make the AI a tremendous profit while benefiting humanity, thus creating positive economic feedback which will increase humanity's wealth and enable them to afford even more future AI inventions.

A year later, the solar system is pure computronium.

I say "a year", even though I think the time is much shorter, because I know that you disagree with me about AI self-improvement trajectories.

The problem with intuitive theories of AI is that any time you point out a flaw in them, people can just handwave them away. For example, I tried to explain to Bill Hibbard why pure reinforcement isn't a complete solution for Friendliness, and he said: "Oh, well, you're not taking into account a temporal credit assignment algorithm." The utility of a formalism like AIXI is that it provides an acid test of your intuitions. If your intuitions say AIXI works, just like your intuitions say Novamente works, then I can actually *demonstrate* that your intuitions are wrong, and because AIXI is a completely defined formal system, it's not possible to handwave about future features of your own system.

Not that *proving* someone wrong really accomplishes much. People can just offer some random other objection, say "Well, my system wouldn't do that", or just ignore me. What I need is for people to exhibit the same kind of paranoia I've taught myself, to genuinely be afraid that their theories are incorrect. I don't mean professing the abstract faith that your theories *might* be incorrect in accordance with the Scientist's Creed, I mean a genuine fear of fucking up really, really badly. I don't know how to make people feel that. Until then, I don't know if cornering people on their mistakes will help, but I don't know what else I can do. Maybe if I can convince everyone *but* the guy attached to his own theory, the guy will become alarmed enough by the apparent PR damage I'm doing with my verbalizable, non-purely-intuitive Friendliness theory to try and develop a firm theoretical basis for his own intuitions, just to show I'm wrong.

> But I suspect that, in order for me to feel that such a theory was
> necessary, I'd have to understand considerably more about AGI than I do
> right now.

Indeed to *feel* that such a theory is necessary, you'd have to understand at least incrementally more about AGI than you do right now. This doesn't mean that it isn't *rationally prudent* to decide such a theory is necessary, even if I weren't here waving my arms and screaming because my theory of AGI morality says that, in fact, you *do* need the theory of AGI morality to build a Friendly AI.

> And I suspect that the only way I'm going to come to understand
> considerably more about AGI, is through experimentation with AGI
> systems. (This is where my views differ from Shane's; he is more
> bullish on the possibility of learning a lot about AGI through
> mathematical theory. I think this will happen, but I think the math
> theory will only get really useful when it is evolving in unison with
> practical AGI work.)
>
> Right now, it is not clear to me that a systematic theory of AGI
> morality is necessary in order to proceed safely. And it is also not
> clear to me that a systematic theory of AGI morality is possible to
> formulate based on our current state of knowledge about AGI.

Well, yes, the really dangerous aspect of not having a model is that you can't compute how dangerous it is not to have a model.

It would have been great if I'd worked out the theory and seen that, yeah, AI development was pretty difficult to screw up. I could have kicked back and relaxed, or at least transferred that energy into pushing vanilla AI development. That's what I originally expected to find. But it's just not true.

> I don't agree that any of your published writings have shown it "has no
> significant probability of going right by accident."

My counterpart to your nonverbal intuitions is my vast body of unpublished theory. The above conclusion is based on a theory which I haven't written down and is too simple to be explained in anything shorter than a book. However, there are rational ways to deal with cases where someone's conclusions are based on logic that cannot pragmatically be explained to you in available time. I worked out that strategy for talking with AIs, but it works for talking to expert humans too. If I'm working from a correct but distant theory, then while some of my conclusions will be effectively inexplicable, many of my other conclusions *will* be effectively explicable. If an expert can show you that at least some apparently inexplicable conclusions are logically understandable, you may rationally decide to trust the expert on conclusions that still seem apparently inexplicable.

This is why it's important that you write, for example:

1)  There is a class of physically realizable problems, which humans can
solve easily for maximum reward, but which - as far as I can tell - AIXI
cannot solve even in principle;

I don't see this, nor do I believe it...

What happened is that I looked at AIXI's formalism and instantly said, "AIXI's decision process can't handle Golden correlations." There are only two other people on the planet to whom I've explained what a Golden correlation is. That doesn't mean that my conclusions are locked away; it means that I have to explain them using longer chains of logic. So I gave an example of a physical challenge that breaks AIXI-tl. You can look at the example and see that, yes, in some way or another, I spotted this flaw in AIXI. You don't know what a Golden correlation is, but you have reason for believing that I have a theory describing something called Golden correlations, and you know that my abstract knowledge that "AIXI can't handle Golden correlations" can be translated into at least one AIXI-tl-breaking physical challenge of the class described. There are also a lot of *other* problems AIXI can't handle because its decision process doesn't handle Golden correlations, but those problems would be far harder to describe.

In a hunter-gatherer tribe, there aren't really the equivalent of, say, physicists. On a basic emotional level we expect that any conclusion that is not complete bull is immediately explainable in terms of premises we know. Conclusions built on conclusions built on conclusions, twenty stories away from any premises you have available to you, aren't really part of the ancestral environment. The closest thing to a mental process for dealing with the non-immediately-obvious is respect for the witch doctor and his cryptic symbols, which is how most people end up treating physicists and their equations.

For the first 22 years of my life I thought I understood international politics well enough to have opinions about it. In retrospect, this was simple hunter-gatherer arrogance.

Distant expertise is very common in today's environment, and it's something we're not emotionally set up to handle at all. The only emotional analogue we have for distant expertise is the witch doctor, and in fact this is exactly how people react on encountering someone who offers conclusions they don't immediately understand. "I don't see that. What do you mean, I don't have enough data? You're saying I can't understand you? That I'm too stupid and should just shut up and believe what you say?" No. You shouldn't treat a possible distant expert as either a witch doctor or a poseur trying to claim the status of witch doctor. You should model the probability that an Other Mind is working from a self-consistent body of reasoning which is cognitively distant from you, try to verify any experientially confirmable or deductively explicable conclusions the Other Mind can offer, and model from there the probability that nonexplicable conclusions are likely to be correct. If you're dealing with a distant human expert, you also need to model the possibility that you're being bamboozled; I know I don't understand international politics, but I also don't trust the people who genuinely are experts on the subject to have my interests in mind or to tell the truth about what they see. So how can I figure out who's right? As far as I can tell, the answer is that I don't. I end up not knowing. Short of trying to become an expert on international politics myself, I can't obtain trustworthy knowledge on the subject.

(But isn't it very important for citizens of a democracy to be able to reach reliable conclusions about whether their political leaders are rational? Yes, it's very desirable that this be the case. What does that have to do with the factual question of whether it is actually the case? Nothing.)

>> 1) AI morality is an extremely deep and nonobvious challenge which
>> has no significant probability of going right by accident.
>
> I agree it's a deep and nonobvious challenge. You've done a great job
> of demonstrating that.
>
> I don't agree that any of your published writings have shown it "has no
> significant probability of going right by accident."

That is correct. My conclusion from unpublished theory is as follows:

Morality cannot be gotten right by accident; what we would define as morally optimal AI is formally nonemergent in the sense that it contains a critical program or critical string, which can only be reprised by transferring the critical string from existing humans or by recreating the conditions of human evolution to an infeasibly precise degree. The critical string is too long to have a nonnegligible probability of occurring by chance.

Suboptimally moral AI, that makes us spend the next billion years wishing we'd written the program just a little differently, but that at least doesn't kill us, is not substantially less difficult than optimally moral AI. The criterion can be satisfied by a larger space of shorter critical strings, but such strings are still nonemergent and too long to occur accidentally, and the theoretical mastery needed to create such a string deliberately is no less than what would be needed for optimally moral AI.

You now need to figure out how much weight to assign this conclusion in the absence of the knowledge that you would need to directly support or countersupport it.

>> 2) If you get the deep theory wrong, there is a strong possibility
>> of a silent catastrophic failure: the AI appears to be learning
>> everything just fine, and both you and the AI are apparently making
>> all kinds of fascinating discoveries about AI morality, and
>> everything seems to be going pretty much like your intuitions predict
>> above, but when the AI crosses the cognitive threshold of
>> superintelligence it takes actions which wipe out the human species
>> as a side effect.
>
> Clearly this could happen, but I haven't read anything in your writings
> leading to even a heuristic, intuitive probability estimate for the
> outcome.

Because that outcome in any given case emerges from the interactions of humans and an AI, it's not the sort of thing that could be formally proved - it's a matter of humans tending to make mistakes that fall into particular patterns. However, every human error in AI morality I have encountered so far can be extrapolated to a failure of this type.

>> If I can demonstrate that your current strategy for AI development
>> would undergo silent catastrophic failure in AIXI - that your stated
>> strategy, practiced on AIXI, would wipe out the human species, and
>> you didn't spot it - will you acknowledge that as a "practice loss"?
>> A practice loss isn't the end of the world. I have one practice loss
>> on my record too. But when that happened I took it seriously; I
>> changed my behavior as a result. If you can't spot the silent
>> failure in AIXI, would you then *please* admit that your current
>> strategy on AI morality development is not adequate for building a
>> transhuman AI? You don't have to halt work on Novamente, just accept
>> that you're not ready to try and create a transhuman AI *yet*.
>
> Eliezer, I have not thought very hard about AIXI/AIXItl and its
> implications.

Nor have I. I spotted three foundational differences in AIXI as soon as I read the mathematical definition of an agent, though I did keep reading before I attached confidence to my initial impression. If you understand any cognitive behavior well enough to create it, you should be able to see immediately how it emerges or does not emerge in AIXI. AIXI is very simple and beautiful in that way.

Consider this test: Take any cognitive behavior you've ever actually suceeded in creating in Webmind or Novamente, or which you already know concretely how to code. Can you see immediately how it emerges in AIXI? Is there anything which you know concretely how to do, or which you have ever succeeded in doing, which you *can't* see immediately as emergent from AIXI? But all the things where you're not really sure how they work, you're not really sure if they emerge from AIXI either, right? Just a guess.

> For better or for worse, I am typing these e-mails at about 90 words a
> minute, inbetween doing other things that are of higher short-term
> priority ;)
>
> What I have or have not "spotted" about AIXI/AIXItl doesn't mean very
> much to me.

Basically, if you can't spot deadly structure in AIXI, it means you don't know how to create safe structure for Novamente; anything you know specifically how to create, you'd be able to spot missing or present in AIXI.

> I don't have time this week to sit back for a few hours and think hard
> about the possible consequences of AIXI/AIXItl as a real AI system.
> For example, I have to leave the house in half an hour for a meeting
> related to some possible Novamente funding; then when I get home I have
> a paper on genetic regulatory network inference using Novamente to
> finish, etc. etc. I wish I had all day to think about theoretical AGI
> but it's not the case. Hopefully, if the practical stuff I'm doing now
> succeeds, in a couple years when Novamente is further along I WILL be
> in a position where I can focus 80% instead of 30% of my time on the
> pure AGI aspects.
>
> What I do or do not spot about AIXI or any other system in a few spare
> moments, doesn't say much about what I and the whole Novamente team
> would or would not spot about Novamente, which we understand far more
> deeply than I understand AIXI, and which we are focusing a lot of our
> time on.

I know, I know, I know. The problem is that just because you really need to win the lottery, it doesn't follow that you will. And just because you really don't have the time, pragmatically speaking, to spend on figuring out certain things, doesn't make ignorance of them any less dangerous. I can't detect anything in the structure of the universe which says that genuinely tired and overworked AI researchers are exempted from the consequences imposed by bitch Nature. "I was feeling really tired that day" is, when you think about it, one of the most likely final epitaphs for the human species. Remember that what I'm trying to convince you is that you're not ready *yet*. "I'm really tired right now, can't find time to discuss it" is evidence in favor of this proposition, not against it. "I'm really tired" is only evidence in favor of pushing ahead anyway if you accept the Mom Logic that things you're too tired to think about can't hurt you.

--
Eliezer S. Yudkowsky http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence

-------
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]

Re: [agi] unFriendly AIXI... and Novamente?

Reply via email to