I've been taking a closer look at transformers. The big advance over LSTM
was that they relate prediction to long distance dependencies directly,
rather than passing long distance dependencies down a long recurrence
chain. That's the whole "attention" shtick. I knew that. Nice.

But something I was less aware of was that having broken long distance
dependencies from the recurrence mechanism seems to have liberated them to
go wild with directly representing dependencies. And with multi layers it
seems they are building hierarchies over what they are "attending" to. So
they are basically building grammars.

This paper makes that clear:

Piotr Nawrot, Hierarchical Transformers are More Efficient Language Models.
https://youtu.be/soqWNyrdjkw

They show that middle layers of language transformers explicitly generalize
to reduce dimensions. That's a grammar.

The question is, whether these grammars are different for each sentence in
their data. If they are different they might reduce the dimensions of
representation each time, but not in any way which can be abstracted
universally.

If the grammars generated are different for each sentence, then the
advantage of transformers over attempts to learn grammar, like OpenCog's,
will be that ignoring the hierarchies created and focusing solely on the
prediction task, frees them from the expectation of universal primitives.
They can generate a different hierarchy for each data sentence, and no-body
notices. Ignorance is bliss.

Set against that advantage, the disadvantage will be that ignoring the
actual hierarchies created means we can't access those hierarchies for
higher reasoning and constraint using world knowledge. Which is indeed the
problem we face with transformers.

And another disadvantage will be the equally known one that generating
billions of subjective hierarchies in advance is enormously costly. And the
less known one dependent on the subjective hierarchy insight, that
generating hierarchies in advance is enormously wasteful of effort, and
limiting. Because there will always be a limit to the number of subjective
hierarchies you can generate in advance.

If all this is true, the next stage to the advance of transformers will be
to find a way to generate only relevant subjective hierarchies at run time.

Transformers learn their hierarchies using back-prop to minimize predictive
error over dot products. These dot products will converge on groupings of
elements which share predictions. If there were a way to directly find
these groupings of elements which share predictions, we might not have to
rely on back-prop over dot products. And we might be able to find only
relevant hierarchies at run time.

So the key to improving over transformers would seem to be to leverage
their (implicit) discovery that hierarchy is subjective to each sentence,
and minimize the burden of generating that infinity of subjective
hierarchies in advance, by finding a method to directly group elements
which share predictions, without using back-prop over dot products. And
applying that method to generate hierarchies which are subjective to each
sentence presented to a system, only at the time each sentence is presented.

If all the above is true, the key question should be: what method could
directly group hierarchies of elements in language which share predictions?

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T5d6fde768988cb74-Mcc9c079782e1c06676c055ea
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to