Richard,
Let me make sure I understand your proposal.  You propose to program
friendliness into the motivational structure of the AGI as tens of thousands
of hand-coded soft constraints or rules.  Presumably with so many rules, we
should be able to cover every conceivable situation now or in the future where
the AGI would have to make a moral decision.  Among these rules: the AGI is
not allowed to modify the function that computes its reward signal, nor it is
allowed to create another AGI with a different function.

You argue that the reward function becomes more stable after RSI.  I presume
this is because when there are a large number of AGIs, they will be able to
observe any deviant behavior, then make a collective decision as to whether
the deviant should be left alone, reprogrammed, or killed.  This policing
would be included in the reward function.

Presumably the reward function is designed by a committee of upstanding
citizens who have reached a consensus on what it means to be friendly in every
possible scenario.  Once designed, it can never be changed.  Because if there
were any mechanism by which all of the AGIs could be updated at once, then
there is a single point of failure.  This is not allowed.  On the other hand,
if the AGIs were updated one at a time (allowed only with human permission),
then the resulting deviant behavior would be noticed by the other AGIs before
they could be updated.  So the reward function remains fixed.

Is this correct?


--- Richard Loosemore <[EMAIL PROTECTED]> wrote:

> Matt Mahoney wrote:
> > --- Richard Loosemore <[EMAIL PROTECTED]> wrote:
> > 
> >> Derek Zahn wrote:
> >>> Richard Loosemore writes:
> >>>
> >>>  > It is much less opaque.
> >>>  >
> >>>  > I have argued that this is the ONLY way that I know of to ensure that
> >>>  > AGI is done in a way that allows safety/friendliness to be
> guaranteed.
> >>>  >
> >>>  > I will have more to say about that tomorrow, when I hope to make an
> >>>  > announcement.
> >>>
> >>> Cool.  I'm sure I'm not the only one eager to see how you can guarantee 
> >>> (read: prove) such specific detailed things about the behaviors of a 
> >>> complex system.
> >> Hmmm... do I detect some skepticism?  ;-)
> > 
> > I remain skeptical.  Your argument applies to an AGI not modifying its own
> > motivational system.  It does not apply to an AGI making modified copies
> of
> > itself.  In fact you say:
> 
> Not correct, I am afraid:  I specifically emphasize that the AGI is 
> allowed to modify its own motivational system.  I don't know how you got 
> the opposite idea.  (I haven't had time to review my text, so apologies 
> if it was my fault and I did accidentally give the wrong impression .... 
> but the whole point of this essay was to suggest a way to gurantee 
> friendliness under any circumstances, including self-improvement).
> 
> >> Also, during the development of the first true AI, we would monitor the 
> >> connections going from motivational system to thinking system.  It would 
> >> be easy to set up alarm bells if certain kinds of thoughts started to 
> >> take hold -- just do it by associating with certain keys sets of 
> >> concepts and keywords.  While we are designing a stable motivational 
> >> system, we can watch exactly what goes on, and keep tweeking until it 
> >> gets to a point where it is clearly not going to get out of the large 
> >> potential well.
> 
> I do not see how this illustrates your point above.
> 
> 
> > You refer to the humans building the first AGI.  Humans, being imperfect,
> > might not get the algorithm for friendliness exactly right in the first
> > iteration.  So it will be up to the AGI to tweak the second copy a little
> more
> > (according to the first AGI's interpretation of friendliness).  And so on.
>  So
> > the goal drifts a little with each iteration.  And we have no control over
> > which way it drifts.
> 
> What an extraordinary statement to make!
> 
> The purpose of the essay was to argue that with each iteration it digs 
> itself deeper into the same pattern and cannot drift out into an 
> unfriendly state.
> 
> But you reply to this by just stating that the opposite is going to be 
> the case, without saying why.  Which part of my argument did you decide 
> was wrong, that you could state the opposite conclusion?
> 
> 
> 
> Richard Loosemore
> 
> 
> 
> 
> -----
> This list is sponsored by AGIRI: http://www.agiri.org/email
> To unsubscribe or change your options, please go to:
> http://v2.listbox.com/member/?&;
> 


-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=8660244&id_secret=48486801-696bc5

Reply via email to