Matt Mahoney wrote:
Richard,
Let me make sure I understand your proposal.  You propose to program
friendliness into the motivational structure of the AGI as tens of thousands
of hand-coded soft constraints or rules.  Presumably with so many rules, we
should be able to cover every conceivable situation now or in the future where
the AGI would have to make a moral decision.  Among these rules: the AGI is
not allowed to modify the function that computes its reward signal, nor it is
allowed to create another AGI with a different function.

You argue that the reward function becomes more stable after RSI.  I presume
this is because when there are a large number of AGIs, they will be able to
observe any deviant behavior, then make a collective decision as to whether
the deviant should be left alone, reprogrammed, or killed.  This policing
would be included in the reward function.

Presumably the reward function is designed by a committee of upstanding
citizens who have reached a consensus on what it means to be friendly in every
possible scenario.  Once designed, it can never be changed.  Because if there
were any mechanism by which all of the AGIs could be updated at once, then
there is a single point of failure.  This is not allowed.  On the other hand,
if the AGIs were updated one at a time (allowed only with human permission),
then the resulting deviant behavior would be noticed by the other AGIs before
they could be updated.  So the reward function remains fixed.

Is this correct?

Well, I am going to assume that Mark is wrong and that you are not trying to be sarcastic, but really do genuinely mean to pose the questions.

You have misunderstood the design at a very deep level, so none ofthe above would happen.

The multiple constraints are not explicitly programmed into the system in the form of semantically interpretable statements (like Asimov's laws), nor would there be a simple "reward function", nor would there be a committe of experts who sat down and tried to write out a complete list of all the rules. These are all old-AI concepts (conventional, non-complex AI), they simply do not map onto the system at all.

The AGI has a motivational system that *biasses* the cloud of concepts in one direction or another, to make the system have certain goals, and the nature of this bias is that during development, the concepts themselves all grew from simple primitive ideas (so primitive that they are not even ideas, but just sources of influence on the concept building process), and these simple primitives reach out through the entire web of adult concepts.

This is a difficult idea to grasp, I admit, but the consequence of that type of system design is that, for example, the general idea of "feeling empathy for the needs and aspirations of the entire human race" is not represented in the system as an explcit memory location that says "Rule number 71, as decided by the Committee of World AGI Ethics Experts, is that you must feel empathy for the entire human race" .... instead, the thing that we externally describe as "empathy" is just a collective result of a massive number of learned concepts and their connections.

This makes "empathy" a _systemic_ characteristic, intrinsic to the entire system, not a localizable rule.

The empathy feeling, to be sure, is controlled by roots that go back to the motivational system, but these roots would be built in such a way that tampering or malfunction would:

(a) not be able to happen without huge intervention, which would be easily noticed, and

(b) not cause any catastrophic behavior even if it did go wrong, because the malfunctioning of the motivational system would render the entire system useless.

Notice that in a real human system, damage to the empathy component can possibly cause trouble, but that is precisely because we have other, dangerous components like our aggression modules, which can take over. These would not be present, so an AGI would degrade gracefully if the empathy system (for some bizarre reason) were interfered with.

And to asnwer you general question: the empathy function would not be constrained to be fixed, because it would be dependent on the wishes of humanity. Or rather, the *nature* of the empathy function would stay the same, but the content (the expression of the empathy) would stay locked in to the desires of humanity, in perpetuity.


Hope that answers the questions.



Richard Loosemore





--- Richard Loosemore <[EMAIL PROTECTED]> wrote:

Matt Mahoney wrote:
--- Richard Loosemore <[EMAIL PROTECTED]> wrote:

Derek Zahn wrote:
Richard Loosemore writes:

 > It is much less opaque.
 >
 > I have argued that this is the ONLY way that I know of to ensure that
 > AGI is done in a way that allows safety/friendliness to be
guaranteed.
 >
 > I will have more to say about that tomorrow, when I hope to make an
 > announcement.

Cool. I'm sure I'm not the only one eager to see how you can guarantee (read: prove) such specific detailed things about the behaviors of a complex system.
Hmmm... do I detect some skepticism?  ;-)
I remain skeptical.  Your argument applies to an AGI not modifying its own
motivational system.  It does not apply to an AGI making modified copies
of
itself.  In fact you say:
Not correct, I am afraid: I specifically emphasize that the AGI is allowed to modify its own motivational system. I don't know how you got the opposite idea. (I haven't had time to review my text, so apologies if it was my fault and I did accidentally give the wrong impression .... but the whole point of this essay was to suggest a way to gurantee friendliness under any circumstances, including self-improvement).

Also, during the development of the first true AI, we would monitor the connections going from motivational system to thinking system. It would be easy to set up alarm bells if certain kinds of thoughts started to take hold -- just do it by associating with certain keys sets of concepts and keywords. While we are designing a stable motivational system, we can watch exactly what goes on, and keep tweeking until it gets to a point where it is clearly not going to get out of the large potential well.
I do not see how this illustrates your point above.


You refer to the humans building the first AGI.  Humans, being imperfect,
might not get the algorithm for friendliness exactly right in the first
iteration.  So it will be up to the AGI to tweak the second copy a little
more
(according to the first AGI's interpretation of friendliness).  And so on.
 So
the goal drifts a little with each iteration.  And we have no control over
which way it drifts.
What an extraordinary statement to make!

The purpose of the essay was to argue that with each iteration it digs itself deeper into the same pattern and cannot drift out into an unfriendly state.

But you reply to this by just stating that the opposite is going to be the case, without saying why. Which part of my argument did you decide was wrong, that you could state the opposite conclusion?



Richard Loosemore




-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?&;



-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?&;



-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=8660244&id_secret=48493498-42effb

Reply via email to