A reward function defines a goal *landscape*, not merely a single goal. It is a function which maps from the domain of all possible internal/external states to a scalar value indicating the value of each of those states. The topology of the goal landscape is induced by the interaction of the reward function with the similarity metric relating different behaviors the system can exhibit. Neighboring coordinates are similar behaviors or behavior policies. An incline in the landscape indicates a direction for local optimization of reward through the continuous change of behavioral parameters. (Animal trainers take advantage of this by artificially inducing such a gradient through manipulation of rewards, a technique called "shaping".)
One reward function is thus enough to indicate an arbitrary number of parallel goals, in the form of multiple intersecting gradients which add together at each point to produce the optimal direction of improvement at that point, much as intersecting waves add together to produce constructive or destructive interference. If a particular behavior or goal is desirable, a gradient must be added to the reward function which pushes the system in the direction of that behavior or goal. If a particular behavior or (anti)goal is undesirable, a gradient must be added to the reward function which cancels any existing gradients which lead in the direction of that behavior or goal. (Here's a paper that explores some techniques for modifying reward functions: http://www.eecs.harvard.edu/~parkes/cs286r/spring06/papers/ngharruss_shap99.pdf I'm sure there are plenty of others.) Reward functions in RL have parallels in the form of fitness landscapes in Evolutionary Algorithms and error derivatives in Artificial Neural Networks -- each a form of dynamic function optimization. Under each of these guises, their characteristics and flaws have been well studied both in theory and in practice. (Check out Game Theory if you'd like to see reward functions, a.k.a. value or utility functions, applied to understanding human behavior, which was being done long before RL was ever invented.) The difficulty lies not in our ability to sculpt and manipulate reward functions, nor in their capability for representing multiple, possibly even conflicting goals, but rather in designing algorithms to effectively pursue reward in unfamiliar environments given minimal experience, as human beings do. This is where the need for concepts and cognitive structures come in. On Tue, Jan 29, 2013 at 3:28 PM, Piaget Modeler <[email protected]>wrote: > Which reward function are you referring to? reward for what specifically? > > When the cognitive system is pursuing multiple parallel goals at different > levels of abstraction, which specific function are > you controlling? > > *~*PM > > ------------------------------ > Date: Tue, 29 Jan 2013 14:58:30 -0600 > > Subject: Re: [agi] Robots and Slavery > From: [email protected] > To: [email protected] > > People have formative years because it's in their genetic best interest to > stop exploring new options and start exploiting known ones, due to a > limited lifetime and the need to reliably reproduce for themselves. We > can't reset that explore/exploit trade-off in people (yet), but in machines > there's no reason to make that control inaccessible to ourselves. It's a > good thing machines aren't children. > > In most RL algorithms, there are two key system parameters that allow > learning to be modulated: the reward expectation learning/update rate, and > the exploration rate. Raising these two values causes the system to learn > faster but make more mistakes. Lowering them causes the system to be more > stable but learn more slowly. An analysis would have to be done to > determine whether the costs/dangers of a system's behavioral aberrations > due to a misshapen reward function outweigh the costs/dangers of raising > the learning & exploration rates while the system relearns the reward > function after modification. > > > On Tue, Jan 29, 2013 at 2:39 PM, Piaget Modeler <[email protected] > > wrote: > > People procreate. And for a certain period of time they have influence > over their creation (children). > But then, children grow up and take responsibility for their own lives, > and we no longer have control. > It's in those formative years that you have influence. > > Similarly, when you create developmental AI, you have some period during > the formative years to > influence the later behavior of the cognitive system. But you don't have > control, and you wouldn't > expect to either. That's why rights are important. > > > ------------------------------ > Date: Tue, 29 Jan 2013 14:11:05 -0600 > > Subject: Re: [agi] Robots and Slavery > From: [email protected] > To: [email protected] > > If we can build a system capable of determining the value of concepts > automatically, we can build a system that can readjust those values > automatically, too. If that's not feasible for the design, it's an unsafe > design, and you shouldn't have the expectation that it will act as you > intend it to. You wouldn't get in a car without a steering wheel, would > you? Would you trust an even more powerful and dangerous machine to just do > the right thing, with no controls? Let's not build any machines of this > uncontrollable nature. > > > > > *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> > <https://www.listbox.com/member/archive/rss/303/23050605-2da819ff> | > Modify<https://www.listbox.com/member/?&>Your Subscription > <http://www.listbox.com> > ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
