So, it looks like to really create any kind of system like this, a black-box seperate programming ability in some sort must be created, wherin we can create the program to 'hard-wire' in the reward system, seperate from the main AI unit, where they cannot in any way change it to reward itself.

  The only problem with that is we DO need it to be changed, in at least subtle ways.
  So how do we deal with that duality.  It needs to change, based on past history, but if we allow the machine to have full control, then it will sooner or later find the magic button, or loop, (morphine for free with no side effects) that will put in in a permanent state of pleasure.

  So how do we do something like that?  Is it possible to have a second small AI in the second box that can modify, but not modify so much that it will reach that state?
  Im not sure that that is possibly really, as it would greatly restrict that changing ability, and that second AI would need to know almost as much as the first AI.

And the other thought you had about some sort of external control has problems as well... We cant really tell the AI, by a button or even words or a signal, that it is doing a good job.  Because eventually it will find that 'button' as well, and realize that it can send the 'good job' signal to itself.  Once it finds how to do that, then it can tell itself when and how to change, and short circuit the loop.

This isnt really what I had discussed earlier though, earlier I had talked more about what actually the modifications to motivation were, not where they were physically coming from.

But any reinforcement that comes from within the bot, or is easily simulated by a signal can be taken over and used.

James

Richard Loosemore <[EMAIL PROTECTED]> wrote:
Hank Conn wrote:
> Although I understand, in vague terms, what idea Richard is attempting
> to express, I don't see why having "massive numbers of weak constraints"
> or "large numbers of connections from [the] motivational system to
> [the] thinking system." gives any more reason to believe it is reliably
> Friendly (without any further specification of the actual processes)
> than one with "few numbers of strong constraints" or "a small number of
> connections between the motivational system and the thinking system".
> The Friendliness of the system would still depend just as strongly on
> the actual meaning of the connections and constraints, regardless of
> their number, and just giving an analogy to an extremely reliable
> non-determinate system (Ideal Gas) does nothing to explain how you are
> going to replicate this in the motivational system of an AGI.
>
> -hank

Hank,

There are three things in my proposal that can be separated, and perhaps
it will help clear things up a little if I explicitly distinguish them.

The first is a general principle about "stability" in the abstract,
while the second is about the particular way in which I see a
motivational system being constructed so that it is stable. The third
is how we take a stable motivational system and ensure it is
Stable+Friendly, not just Stable.



About stability in the abstract. A system can be governed, in general,
by two different types of mechanism (they are really two extremes on a
continuum, but that is not important): the fragile, deterministic type,
and the massively parallel weak constraints type. A good example of the
fragile deterministic type would be a set of instructions for getting to
a particular place in a city which consisted of a sequence of steps
you should take, along named roads, from a special starting point. An
example of a weak constraint version of the same thing would be to give
a large set of clues to the position of the place (near a pond, near a
library, in an area where Dickens used to live, opposite a house painted
blue, near a small school, etc.).

The difference between these two approaches would be the effects of a
disturbance on the system (errors, or whatever): the fragile one only
needs to be a few steps out and the whole thing breaks down, whereas the
multiple constraints version can have enormous amounts of noise in it
and still be extremely accurate. (You could look on the Twenty
Questions game in the same way: 20 vague questions can serve to pin
down most objects in the world of our experience).

What is the significance of this? Goal systems in conventional AI have
an inherent tendency to belong to the fragile/deterministic class. Why
does this matter? Because it would take very little for the AI system
to change from its initial design (with friendliness built into its
supergoal) to one in which the friendliness no longer dominated. There
are various ways that this could happen, but the one most important, for
my purposes, is where the interpretation of "Be Friendly" (or however
the supergoal is worded) starts to depend on interpretation on the part
of the AI, and the interpretation starts to get distorted. You know the
kind of scenario that people come up with: the AI is told to be
friendly, but it eventually decides that because people are unhappy much
of the time, the only logical way to stop all the unhappiness is to
eliminate all the people. Something stupid like that. If you trace
back the reasons why the AI could have come to such a dumb conclusion,
you eventually realize that it is because the motivation system was so
fragile that it was sensitive to very, very small perturbations -
basically, one wrong turn in the logic and the result could be
absolutely anything. (In much the same way that one small wrong step or
one unanticipated piece of road construction could ruin a set of
directions that told you how to get to a place by specifying that you go
251 steps east on Oxford Street, then 489 steps north on.... etc.).

The more you look at those conventional goal systems, the more they look
fragile. I cannot give all the arguments here because they are
extensive, so maybe you can take my word for it. This is one reason
(though not the only one) why efforts to mathematically prove the
validity of one of those goal systems under recursive self-improvement
is just a complete joke.

Now, what I have tried to argue is that there are other ways to ensure
the stability of a system: the multiple weak constraints idea is what
was behind my original mention of an Ideal Gas. The P, V and T of an
Ideal Gas are the result of many constraints (the random movements of
vast numbers of constituent particles), and as a result the P, V and T
are exquisitely predictable.

The question becomes: can you control/motivate the behavior of an AI
using *some* variety of motivational system that belongs in the "massive
numbers of weak constraints" category? If you could find any way of
doing this you could perhaps make the thing as reliable as an Ideal Gas.

So that is Part One of the two part argument: if you can find a
motivational system of that sort, you are (almost) home and dry. My
initial claim is that this is possible in principle.



Part two of the argument is then about *how* to do that.

To understand how this is feasible, imagine an AI system in which the
"concepts" (basic units of knowledge) grow from primitive seed concepts
(like Momma, Food, Drink, Warmth, Friends, Play, Curiosity .... ). At
the beginning the primitive seeds are coupled together by an equally
primitive motivational system that basically hardwires the ways that the
system can get pleasure.

What happens after that is that new concepts develop in abundance.
Glossing over an immensely complex process, we can note two things::

1) The system is not governed by a strict goal stack, it just improvises
the things it "wants" to do within broad parameters defined by the goal
system. What this means is that there is an attentional focus: a set
of concepts and models that are currently most active and that define
what the system is thinking about and trying to do at any given moment.
This attentional focus wanders around and occasionally gets pushed in
one direction or another by the motivational system. The motivational
system is not smart (no knowledge base there) it just pulls things in
one direction or another.

2) The way that it pulls things around depends on two different kinds of
connections in the system. The concepts are mutually linked because
they represent the learned structure of the world. But quite separate
from these regular links there is another set that grow from the
primitive concepts like branches on a tree: these cross-connect the
system, so that every concept is linked back, ultimately, to primitives.


Now, one set of primitives are the primitive motivations and the plans
and actions the system can take in order to satisfy those primitives.
As time goes on, the sophistication of those plans can become immense,
but they still connect back. For example, in childhood, if you were
cold you would snuggle up to a warm object, but as an adult you plan for
cold situations by getting a warm house (and you plan to earn money to
pay for the house and for fuel, etc. etc.). When you look at all the
things we know how to do, you find that all of those concepts (plans,
schemas, models, etc) are linked back to primitive motivations, through
intermediaries.

What this means, in practice, is that in the adult, the motivational
system has its fingers in all of the stuff that goes on in the thinking
part of the system, because of the historical development of that stuff.

Further, when the system tries to decide what to do, it considers
candidate plans and runs them through its internal simulator to see
whether they are consistent with its primitive motivational goals (in
other words, it thinks about and has a gut feeling about whether a
particular proposed behavior is really consistent with its motivational
goals. If, at any stage, this modelling turns up any inconsistencies,
alarm bells go off and the system is forced to confront the problem
(you'll recall the examople I gave originally about a person who wanted
to sell their mother for money, and the consequences this had inside
their thinking system).

Now, if you look at the overall design of this motivational system you
will notice that it does not look at all like a traditional AI goal
stack (aka planning system), and that it does look like there are
massive numbers of constraints at work.

There are constraints in at least two places: the motivational tree of
connections that reach out into the concept space, and the check and
balance mechanisms that cause plans to be tested against evolving
criteria. Each of these involve such huge numbers of constraints that
it is very hard to shake the system away from the mindset that it has,
once it is fully grown.

This, I claim, is the basic outline of the way in which stability is
achieved in a motivational system.




Finally, there is the question of why such a system would be "friendly"
at all. How to define friendly inside the system, in such a way that
the system is actually doing things that seem to us to be friendly?
After all, it might be utterly stable, but we need stable+(does what we
want).

Not easy, but I note that there is an ideal type of human behavior that
is of the sort we desire. We can all imagine a "saintlike" human who is
extraordinarily fair and reasonable. I believe that that archetype can
actually be understood in terms of the balance of the various
motivational primitives inside them, and the way they developed. That
is where out research efforts should be directed: toward understanding
exactly how such a motivational system is structured.

All I can do on that point is to say that I see it as being possible,
and I see no arguments being advanced as to why that should be an
impossible goal for us to achieve.

At the end of the day, what we could do is to give the system the purest
version of that "saintlike" motivational system (NOT a mother-hen, Nanny
type, I hasten to add, because I know that some people always interpret
it that way), and then accept that, hey, it is going to have trouble
with some of the same moral and ethical dilemmas that we have trouble
with. But even though it, and we, will have those dilemmas, it will be
no more unpredictable and alien in its thoughts than an extremely
(1000x) peaceful, benign version of Mahatma Ghandi (and without some of
the motivational flaws that even Ghandi possessed).

What I have done is to show a way to get there. A complete solution, in
principle, to the Friendliness Problem, in which only the fine details
remain. Perhaps there is a showstopper in the fine details. I don't
believe so, but if anyone else thinks there is, they need to be
specific, and not just say that "it doesn't look like it will work".


Hope that helps.




Richard Loosemore








-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/[EMAIL PROTECTED]



Thank You
James Ratcliff
http://falazar.com


Get your email and see which of your friends are online - Right on the new Yahoo.com
This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/[EMAIL PROTECTED]

Reply via email to