A reward function defines a goal *landscape*, not merely a single goal. It
is a function which maps from the domain of all possible internal/external
states to a scalar value indicating the value of each of those states. The
topology of the goal landscape is induced by the interaction of the reward
function with the similarity metric relating different behaviors the system
can exhibit. Neighboring coordinates are similar behaviors or behavior
policies. An incline in the landscape indicates a direction for local
optimization of reward through the continuous change of behavioral
parameters. (Animal trainers take advantage of this by artificially
inducing such a gradient through manipulation of rewards, a technique
called "shaping".)

One reward function is thus enough to indicate an arbitrary number of
parallel goals, in the form of multiple intersecting gradients which add
together at each point to produce the optimal direction of improvement at
that point, much as intersecting waves add together to produce constructive
or destructive interference. If a particular behavior or goal is desirable,
a gradient must be added to the reward function which pushes the system in
the direction of that behavior or goal. If a particular behavior or
(anti)goal is undesirable, a gradient must be added to the reward function
which cancels any existing gradients which lead in the direction of that
behavior or goal. (Here's a paper that explores some techniques for
modifying reward functions:
http://www.eecs.harvard.edu/~parkes/cs286r/spring06/papers/ngharruss_shap99.pdf
I'm sure there are plenty of others.)

Reward functions in RL have parallels in the form of fitness landscapes in
Evolutionary Algorithms and error derivatives in Artificial Neural Networks
-- each a form of dynamic function optimization. Under each of these
guises, their characteristics and flaws have been well studied both in
theory and in practice. (Check out Game Theory if you'd like to see reward
functions, a.k.a. value or utility functions, applied to understanding
human behavior, which was being done long before RL was ever invented.) The
difficulty lies not in our ability to sculpt and manipulate reward
functions, nor in their capability for representing multiple, possibly even
conflicting goals, but rather in designing algorithms to effectively pursue
reward in unfamiliar environments given minimal experience, as human beings
do. This is where the need for concepts and cognitive structures come in.



On Tue, Jan 29, 2013 at 3:28 PM, Piaget Modeler
<[email protected]>wrote:

>  Which reward function are you referring to? reward for what specifically?
>
> When the cognitive system is pursuing multiple parallel goals at different
> levels of abstraction, which specific function are
> you controlling?
>
> *~*PM
>
> ------------------------------
> Date: Tue, 29 Jan 2013 14:58:30 -0600
>
> Subject: Re: [agi] Robots and Slavery
> From: [email protected]
> To: [email protected]
>
> People have formative years because it's in their genetic best interest to
> stop exploring new options and start exploiting known ones, due to a
> limited lifetime and the need to reliably reproduce for themselves. We
> can't reset that explore/exploit trade-off in people (yet), but in machines
> there's no reason to make that control inaccessible to ourselves. It's a
> good thing machines aren't children.
>
> In most RL algorithms, there are two key system parameters that allow
> learning to be modulated: the reward expectation learning/update rate, and
> the exploration rate. Raising these two values causes the system to learn
> faster but make more mistakes. Lowering them causes the system to be more
> stable but learn more slowly. An analysis would have to be done to
> determine whether the costs/dangers of a system's behavioral aberrations
> due to a misshapen reward function outweigh the costs/dangers of raising
> the learning & exploration rates while the system relearns the reward
> function after modification.
>
>
> On Tue, Jan 29, 2013 at 2:39 PM, Piaget Modeler <[email protected]
> > wrote:
>
>  People procreate.   And for a certain period of time they have influence
> over their creation (children).
> But then, children grow up and take responsibility for their own lives,
> and we no longer have control.
> It's in those formative years that you have influence.
>
> Similarly, when you create developmental AI, you have some period during
> the formative years to
> influence the later behavior of the cognitive system.  But you don't have
> control, and you wouldn't
> expect to either.   That's why rights are important.
>
>
> ------------------------------
> Date: Tue, 29 Jan 2013 14:11:05 -0600
>
> Subject: Re: [agi] Robots and Slavery
> From: [email protected]
> To: [email protected]
>
> If we can build a system capable of determining the value of concepts
> automatically, we can build a system that can readjust those values
> automatically, too. If that's not feasible for the design, it's an unsafe
> design, and you shouldn't have the expectation that it will act as you
> intend it to. You wouldn't get in a car without a steering wheel, would
> you? Would you trust an even more powerful and dangerous machine to just do
> the right thing, with no controls? Let's not build any machines of this
> uncontrollable nature.
>
>
>
>
>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/23050605-2da819ff> |
> Modify<https://www.listbox.com/member/?&;>Your Subscription
> <http://www.listbox.com>
>



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to