[agi] "Reward" and "utility" are fundamentally the same

Tim Tyler Sat, 01 Mar 2014 04:41:25 -0800

On 28/02/2014 11:06, Ben Goertzel wrote:

See


http://hplusmagazine.com/2014/02/28/saving-the-world-with-analytical-philosophy/

;-)


There, Matt writes:

"Reinforcement learning is slow because a reward signal transmits fewer
bits to a complex system than updating the code or giving explicit directions.
And self-modification would add no bits at all. MIRI needs to explain why this
will change in the future."

This seems like a debate about what "reinforcement learning" means. Typically, 
a reward signal
is not the only information that a "reinforcement learning agent" gains from 
its environment
while it is learning. Such agents typically also have a conventional sensory 
array - from which
they can receive "explicit directions" and other forms of input. If an agent 
has previously
learned that "explicit directions" offer a useful shortcut to receiving 
rewards, then it will
pay a good deal of attention to them.

If you look on Wikipedia, you will probably see things like: "Reinforcement 
learning differs
from standard supervised learning 
<http://en.wikipedia.org/wiki/Supervised_learning> in that correct input/output 
pairs are never presented,
nor sub-optimal actions explicitly corrected. "

However, that sort of distinction is (arguably) not a very useful one for 
machine
intelligence enthusiasts. In practice learning agents typically learn that 
supervisors
are worth following and corrections are worth heeding *via* reinforcement 
learning.

Basically, it isn't true that a "reinforcement learning agent" can't use 
supervised
learning, or make use of correction. Nor is it true that it has an impoverished 
rate
of learning through its limited scalar reward signal.  This whole business 
seems like
a muddle derived from attempting to distinguish the various different types of 
learning.
Reinforcement learning isn't best seen as an alternative to supervised learning 
-
rather reinforcement underpins *all* types of learning.  It's the "universal 
currency"
of learning - in the same way that fitness is the "universal currency" of 
evolution,
money is the "universal currency" of economics or utility is the "universal 
currency"
of utilitarians.

Nor does it make very much theoretical difference whether the "goodness scalar"
involved in the "universal currency" comes from the environment directly, or is
synthesized internally from sensory data and current state using a "utility 
function".
Alas, an awful lot of hot air seems to surround this last point.
--
__________
 |im |yler  http://timtyler.org/  [email protected]  Remove lock to reply.




-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

[agi] "Reward" and "utility" are fundamentally the same

Reply via email to