Bill Hibbard wrote:
Hi Eliezer,

It looks like Williams' book is more about the perils of Asimov's
Laws than about hard-wiring. As logical constraints, Asimov's Laws
suffer from the grounding problem. Any analysis of brains as purely
logical runs afoul of the grounding problem. Brains are statistical
(or, if you prefer, "fuzzy"), and logic must emerge from statistical
processes. That is, symbols must be grounded in sensory experience,
reason and planning must be grounded in learning, and goals must be
grounded in values.
This solves a *small* portion of the Friendliness problem. It doesn't solve all of it.

There is more work to do even after you ground symbols in experience, planning in learned models, and goals (what I would call "subgoals") in values (what I would call "supergoals"). For example, Prime Intellect *does* do reinforcement learning and, indeed, goes on evolving its definitions of, for example, "human", as time goes on, yet Lawrence is still locked out of the goal system editor and humanity is still stuck in a pretty nightmarish system because Lawrence picked the *wrong* reinforcement values and didn't give any thought about how to fix that. Afterward, of course, Prime Intellect locked Lawrence out of editing the reinforcement values, because that would have conflicted with the very reinforcement values he wanted to edit. This also happens with the class of system designs you propose. If "temporal credit assignment" solves this problem I would like to know exactly why it does.

Also, while I advocate hard-wiring certain values of intelligent
machines, I also recognize that such machines will evolve (there
is a section on "Evolving God" in my book). And as Ben says, once
things evolve there can be no absolute guaratees. But I think
that a machine whose primary values are for the happiness of all
humans will not learn any behaviors to evolve against human
interests. Ask any mother whether she would rewire her brain
to want to eat her children. Designing machines with primary
values for the happiness of all humans essentially defers their
values to the values of humans, so that machine values will
adapt to evolving circumstances as human values adapt.
Erm... damn. I've been trying to be nice recently, but I can't think of any way to phrase my criticism except "Basically we've got a vague magical improvement force that fixes all the flaws in your system?"

What kind of evolution? How does it work? What does it do? Where does it go? If you don't know where it ends up, then what forces determine the trajectory and why do you trust them? Why doesn't your system shut off the reinforcement mechanism on top-level goals for exactly the same reason Prime Intellect locks Lawrence out of the goal system editor. Why doesn't your system wirehead on infinitely increasing the amount of "reinforcement" by direct editing its own code? What exactly happens in each of these cases? How? Why? We are talking about the fate of the human species here. Someone has to work out the nitty-gritty, not just to implement the system, but to even know for any reason beyond pure ungrounded hope that Friendliness *can* be made to work. I understand that you *hope that* machines will evolve, and that you hope this will be beneficial to humanity. Hope is not evidence. As it stands, using reinforcement learning alone as a solution to Friendliness can be modeled to malfunction in pretty much the same way Prime Intellect does. If you have a world model for solving the temporal credit assignment problem, exactly the same thing happens. That's the straightforward projection. If evolution is supposed to fix this problem, you have to explain how.

--
Eliezer S. Yudkowsky http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence

-------
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]

Reply via email to